Matches in Nanopublications for { ?s <http://purl.org/spar/c4o/hasContent> ?o ?g. }
- paragraph hasContent "We empirically analyzed the performance of the two crowdsourcing workflows described in Section 4: the first workflow combines LD experts in the Find stage with microtask (lay) workers from MTurk in the Verify stage; the second workflow consists of executing both Find and Verify stages with microtask workers. It is important to highlight that, in the experiments of the Verify stage, workers did not know that the data provided to them was previously classified as problematic." assertion.
- paragraph hasContent "In addition, we executed baseline approaches to detect quality issues which allow us to understand the strengths and limitations of applying crowdsourcing in this scenario. We used RDFUnit [21] for ‘object value’ and ‘datatype’ issues, and implemented a simple baseline for detecting incorrect ‘interlinks’." assertion.
- paragraph hasContent "In our experiments, the assessed triples were extracted from the DBpedia dataset (version 3.9). 9 As de- scribed in Section 4.1, the TripleCheckMate tool was used in the contest. For the microtask crowdsourcing approaches, Algorithms 1 and 2 were implemented to generate the corresponding microtasks for the Find and Verify stages, respectively, in Python 2.7.2. Resulting microtasks were submitted as HITs to Amazon Mechanical Turk using the MTurk SDK for Java. 10" assertion.
- section-number hasContent "5.1.1" assertion.
- section-5.1.1-title hasContent "Dataset and Implementation" assertion.
- paragraph hasContent "The goal of our experiments is to detect whether RDF triples are incorrect. Based on this, we define:" assertion.
- paragraph hasContent "To measure the performance of the studied crowdsourcing approaches (contest and microtasks), we report on: i) inter-rater agreement computed with the Fleiss’ kappa metric in order to measure the consensus degree among raters (experts or MTurk workers); ii) precision to measure the quality of the outcome of P each crowd, computed as T P T +F P ." assertion.
- section-number hasContent "5.1.2" assertion.
- section-5.1.2-title hasContent "Metrics" assertion.
- paragraph hasContent "Two of the authors of this paper (MA, AZ) generated a gold standard for two samples of the crowd- sourced triples. To generate the gold standard, each author independently evaluated the triples. After an individual assessment, they compared their results and resolved the conflicts via mutual agreement. The first sample evaluated corresponds to the set of triples obtained from the contest and submitted to MTurk. The inter-rater agreement between the authors for this first sample and was 0.4523 for object values, 0.5554 for datatypes, and 0.5666 for interlinks. For the second sample, we analyzed a subset from the triples identified in the Find stage by the crowd as ‘incorrect’. The subset has the same distribution of quality issues and triples as the one assessed in the first sample: 509 triples for object values, 341 for datatypes/language tags, and 223 for interlinks. We measured the inter- rater agreement for this second sample and was 0.6363 for object values, 0.8285 for datatypes, and 0.7074 for interlinks. The inter-rater agreement values were calculated using the Cohen’s kappa measure [6], designed for measuring agreement among two annotators. Dis- agreement arose in the object value triples when one of the reviewers marked number values which are rounded up to the next round number as correct. For example, the length of the course of the “1949 Ulster Grand Prix” was 26.5Km in Wikipedia but rounded up to 27Km in DBpedia. In case of datatypes, most dis- agreements were considering the datatype “number” of the value for the property “year” as correct. For the links, those containing unrelated content, were marked as correct by one of the reviewers since the link existed in the Wikipedia page." assertion.
- paragraph hasContent "The tools used in our experiments and the results are available online, including the outcome of the contest, 11 the gold standard and microtask data (HITs and results). 12" assertion.
- section-number hasContent "5.1.3" assertion.
- section-5.1.3-title hasContent "Gold Standard" assertion.
- section-number hasContent "5.1" assertion.
- section-5.1-title hasContent "Experimental Settings" assertion.
- paragraph hasContent "Participant expertise: We relied on the expertise of members of the Linked Data and the DBpedia communities who were willing to take part in the contest." assertion.
- paragraph hasContent "Task complexity: In the contest, each participant was assigned the concise bound description of a DBpedia resource. All triples belonging to that resource were displayed and the participants had to validate each triple individually for quality problems. More-over, when a problem was detected, the participant had to map it to one of the problem types from a quality problem taxonomy." assertion.
- paragraph hasContent "Monetary reward: We awarded the participant who evaluated the highest number of resources a Samsung Galaxy Tab 2 worth 300 EU." assertion.
- paragraph hasContent "Assignments: Each resource was evaluated by at most two different participants." assertion.
- section-number hasContent "5.2.1" assertion.
- section-5.2.1-title hasContent "Contest Settings: Find Stage" assertion.
- paragraph hasContent "Worker qualification: In MTurk, the requester can filter workers according to different qualification metrics. In this experiment, we recruited workers with “Approval Rate” greater than 50%." assertion.
- paragraph hasContent "HIT granularity: In each HIT, we asked the workers to solve five different questions (β = 5). Each question corresponds to an RDF triple and each HIT contains triples classified into one of the three quality issue categories discussed earlier." assertion.
- paragraph hasContent "Monetary reward: The micropayments were fixed to 4 US dollar cents. Considering the HIT granularity, we paid 0.04 US dollar per 5 triples." assertion.
- paragraph hasContent "Assignments: The number of assignments was set up to five and the answer was selected applying majority voting. We additionally compared the quality achieved by a group of workers vs. the resulting quality of the worker who submitted the first answer, in order to test whether paying for more answers (assignments) actually increases the quality of the results." assertion.
- section-number hasContent "5.2.2" assertion.
- section-5.2.2-title hasContent "Microtask Settings: Verify Stage" assertion.
- table hasContent "Table 3" assertion.
- paragraph hasContent "The contest was open for a predefined period of time of three weeks. During this time, 58 LD experts analyzed 521 distinct DBpedia resources and, considering an average of 47.19 triples per resource in this data set [43], the experts browsed around 24, 560 triples. They detected a total of 1, 512 triples as erroneous and classified them using the given taxonomy. After obtaining the results from the experts, we filtered out duplicates, triples whose objects were broken links and the external pages referring to the DBpedia Flickr Wrapper. In total, we submitted 1, 073 triples to the crowd. A total of 80 distinct workers assessed all the RDF triples in four days. A summary of these observations are shown in Table 3." assertion.
- table hasContent "Table 4" assertion.
- paragraph hasContent "We compared the common 1, 073 triples assessed in each crowdsourcing approach against our gold standard and measured precision as well as inter-rater agreement values for each type of task (see Table 4). For the contest-based approach, the tool allowed two participants to evaluate a single resource. In total, there were 268 inter-evaluations for which we calculated the triple-based inter-agreement (adjusting the observed agreement with agreement by chance) to be 0.38. For the microtasks, we measured the inter-rater agreement values between a maximum of 5 workers for each type of task using Fleiss’ kappa measure [10]. While the inter-rater agreement between workers for the interlinking was high (0.7396), the ones for object values and datatypes was moderate to low with 0.5348 and 0.4960, respectively. Table 4 reports on the precision achieved by the LF experts and crowd in each stage. In the following we present further details on the results for each type of task." assertion.
- section-number hasContent "5.2.3" assertion.
- section-5.2.3-title hasContent "Overall Results" assertion.
- paragraph hasContent "As reported in Table 4, our crowdsourcing experiments reached a precision of 0.90 for MTurk workers (majority voting) and 0.72 for LD experts. Most of the missing or incomplete values that are extracted from Wikipedia occur with the predicates related to dates, for example: (2005 Six Nations Championship, Date, 12). In these cases, the experts and workers presented a similar behavior, classifying 110 and 107 triples correctly, respectively, out of the 117 assessed triples for this class. The difference in precision between the two approaches can be explained as follows. There were 52 DBpedia triples whose values might seem erroneous, although they were cor- rectly extracted from Wikipedia. One example of these triples is: (English (programming language), Influenced by, ?) . We found out that the LD experts classified all these triples as incorrect. In contrast, the workers successfully answered that 50 out of this 52 were correct, since they could easily compare the DB- pedia and Wikipedia values in the HITs." assertion.
- section-number hasContent "5.2.4" assertion.
- section-5.2.4-title hasContent "Results: Incorrect/missing Values" assertion.
- paragraph hasContent "Table 4 exhibits that the experts are reliable (with 0.83 of precision) on finding this type of quality issue, while the precision of the crowd (0.51) on verifying these triples is relatively low. In particular, the first answers submitted by the crowd were slightly better than the results obtained with majority voting. A detailed study of these cases showed that 28 triples that were initially classified correctly, later were misclassified, and most of these triples refer to a language datatype. The low performance of the MTurk workers compared to the experts is not surprising, since this particular task requires certain technical knowledge about datatypes and, moreover, the specification of values and types in LD." assertion.
- paragraph hasContent "In order to understand the previous results, we analyzed the performance of experts and workers at a more fine-grained level. We calculated the frequency of occurrences of datatypes in the assessed triples (see Figure 6a) and reported the number of true positives (TP) and false positives (FP) achieved by both crowdsourcing methods for each type of task. Figure 6b depicts these results. The most notorious result in this task is the assessment performance for the datatype “number”. The experts effectively identified triples where the datatype was incorrectly assigned as “number” 13 for instance, in the triple where the crowd was confused and determined that datatype was correct, thus generating a large number of false positives. Nevertheless, it could be argued that the data type “number” in the previous example is not completely incorrect, when being unaware of the fact that there are more specific data types for representing time units. Under this assumption, the precision of the crowd would have been 0.8475 and 0.8211 for first answer and majority voting, respectively." assertion.
- paragraph hasContent "While looking at the language-tagged strings in “English” (in RDF @en), Figure 6b shows that the experts perform very well when discerning whether a given value is an English text or not. The crowd was less successful in the following two situations: (i) the value corresponded to a number and the remaining data was specified in English, e.g., (St. Louis School Hong Kong, founded, 1864); and (ii) the value was a text without special characters, but in a different language than English, for example German (Woellersdorf-Stein abrueckl, Art, Marktgemeinde). The performance of both crowdsourcing approaches for the remaining datatypes were similar or not relevant due the low number of triples processed." assertion.
- section-number hasContent "5.2.5" assertion.
- section-5.2.5-title hasContent "Results: Incorrect Datatypes or Language Tags" assertion.
- paragraph hasContent "Table 4 displays the precision for each studied quality assessment mechanism. The extremely low precision of 0.15 of the contest’s participants was unexpected. We analyzed in detail the 189 misclassifications of the experts:" assertion.
- paragraph hasContent "The two settings of the MTurk workers outperformed the baseline approach. The ‘first answer’ setting reports a precision of 0.62, while the ‘majority voting’ achieved a precision of 0.94. The 6% of the links that were not properly classified by the crowd corresponds to those web pages whose con- tent is in a different language than English or, de- spite they are referenced from the Wikipedia article of the subject, their association to the subject is not straightforward. Examples of these cases are the following subjects and links: ‘Frank Stanford’ and http://nwar.com/drakefield/ , ‘Forever Green’ and http://www.stirrupcup.co.uk . We hypothesize that the design of the user interface of the HITs – displaying a preview of the web pages to analyze – helped the workers to easily identify those links containing related content to the triple subject." assertion.
- section-number hasContent "5.2.6" assertion.
- section-5.2.6-title hasContent "Results: Incorrect Links" assertion.
- section-number hasContent "5.2" assertion.
- section-5.2-title hasContent "Evaluation of Combining LD Experts (Find Stage) and Microtasks (Verify Stage)" assertion.
- paragraph hasContent "The microtasks crowdsourced in the Find stage were configured as follows:" assertion.
- paragraph hasContent "All triples identified as erroneous by at least to workers in the Find stage were candidates for crowdsourcing in the Verify stage. The microtasks generated in the subsequent stage were crowdsourced with the exact same configurations used in the Verify stage from the first workflow (cf. Section 5.2.2)." assertion.
- section-number hasContent "5.3.1" assertion.
- section-5.3.1-title hasContent "Microtask Settings: Find and Verify Stages" assertion.
- table hasContent "Table 5" assertion.
- paragraph hasContent "In order to replicate the approach followed in the contest, in the Find stage, we crowdsourced all the triples associated with resources that were explored by the LD experts. In total, we submitted to the crowd 30, 658 RDF triples. The microtasks were resolved by 187 distinct workers who identified 26, 835 triples as erroneous in 14 days, and classified them into the three quality issues studied in this work. Then, we selected samples from triples identified as erroneous in the Find stage by at least two workers from the crowd. This allowed us to fairly compare the outcome of the Verify stage from both workflows. Each sample contains the exact same number of triples that were crowdsourced in the Verify Stage in the first workflow, i.e., 509 triples with object value issues, 341 with data type or language tag issues, and 223 with interlinks issues. All triples crowdsourced in the Verify Stage were assessed by 141 workers in seven days. A summary of these results and further details are presented in Table 5." assertion.
- paragraph hasContent "Similar to the previous experiment, we measured the inter-rater agreement achieved by the crowd in both stages using the Fleiss’ kappa metric. In the Find stage the inter-rater agreement of workers was 0.2695, while in the Verify stage, the the crowd achieved substantial agreement for all the types of tasks: 0.6300 for object values, 0.7957 for data types or language tags, and 0.7156 for interlinks. In comparison to the first workflow, the crowd in the Verify stage achieved higher agreement. This suggests that the triples identified as erroneous in the Find stage were easier to interpret or process by the crowd. Table 6 reports on the precision achieved by the crowd in each stage. It is important to notice that in this workflow we crowdsourced all the triples that could have been explored by the LD experts in the contest. In this way, we evaluate the performance of lay user and experts under similar conditions. During the Find stage, the crowd achieved low values of precision for the three types of tasks, which suggests that this stage is still very challenging for lay users. In the following we present further details on the results for each type of task." assertion.
- section-number hasContent "5.3.2" assertion.
- section-5.3.2-title hasContent "Overall Results" assertion.
- paragraph hasContent "In the Find stage, the crowd achieved a precision of 0.3713 for identifying ‘incorrect/missing values’, as reported in Table 6. In the following we present relevant observations derived form this evaluation:" assertion.
- paragraph hasContent "The crowd in the Verify stage achieved similar precision for both settings ‘first answer’ and ‘majority voting’, with values of 0.4980 and 0.5072, respectively. Errors from the first iteration were reduced in the Verify stage, especially in triples with predicates dbpedia-prop:dateOfBirth and dbpedia-prop:placeOfBirth ; 38 out of 46 of these triples were correctly classified in the Verify stage. Workers in this stage still made similar errors as the ones previously discussed – triples encoding DBpedia metadata and geo-coordinates, and incomprehensible predicates – although in a lower scale in comparison to the Find stage." assertion.
- section-number hasContent "5.3.3" assertion.
- section-5.3.3-title hasContent "Results: Incorrect/missing Values" assertion.
- figure hasContent "Figure 7" assertion.
- paragraph hasContent "In this type of task, the crowd in the Find stage focused on assessing triples whose objects correspond to language-tagged literals. Figure 7a shows the distribution of the datatypes and language tags in the sampled triples processed by the crowd. Out of the 341 analyzed triples, 307 triples identified as ‘erroneous’ in this stage were annotated with language tags. As reported on Table 6, the crowd in the Find stage achieved a precision of 0.1466, being the lowest precision achieved in all the microtask settings. Most of the triples (72 out of 341) identified as ‘incorrect’ in this stage were annotated with the English language tag. We corroborated that false positives in other languages were not generated due to malfunctions of the interface of the HITs: microtasks were properly displaying non UTF-8 characters used in several languages in DBpedia, e.g., Russian, Japanese, Chinese, among others." assertion.
- paragraph hasContent "In the Verify stage of this type of task, the crowd outperformed the precision of the Find stage, achieving values of 0.5510 for the ‘first answer’ setting and 0.8723 with ‘majority voting’. This major improvement on the precision put in evidence the importance of having the a multi-validation pattern like Find-Fix-Verify in which initial errors can be reduced in subsequent iterations. Congruently with the behavior observed in the first workflow, MTurk workers perform well when verifying language-tagged literals. Further- more, the high values of inter-rater agreement con- firms that the crowd is consistently good in this par- ticular scenario. Figure 7b depicts the results of the ‘majority voting’ setting when classifying triples correctly, i.e., true positives (TP) and true negatives (TN), vs. misclassifying triples, i.e. false positives (FP) and false negatives (FN). We can observe that the crowd is exceptionally successful in identifying correct triples that were classified as erroneous in the previous stage (true negatives). This can be confirmed by the high value of accuracy 17 (0.9531) achieved by the crowd in this stage with ‘majority voting’. A closer inspection to the six false positives revealed that in three cases the crowd misclassified triples whose object is a proper noun, for instance, (Tiszaszentimre, name, Tiszaszen- timre@en) and (Ferrari Mythos, label, Ferrari Mythos@de) ; in the other three cases the object of the triple corresponds to a common noun or text in the following languages: Italian, Portuguese, and English, for example, (Book, label, Libro@it) ." assertion.
- section-number hasContent "5.3.4" assertion.
- section-5.3.4-title hasContent "Results: Incorrect Datatypes or Language Tags" assertion.
- paragraph hasContent "From the studied sample, in the Find stage the crowd classified as ‘incorrect interlink’, those RDF triples whose objects correspond to RDF resources (and not Web pages); this is the case of the majority of the triples. We analyzed in detail the characteristics of the 169 misclassified triples by the crowd in this stage:" assertion.
- paragraph hasContent "In the Find stage, the crowd achieved similar values of precision in both settings ‘first answer’ and ‘majority voting’. Furthermore, in this stage the crowd achieved higher precision (0.5291 for ‘majority voting’) than in the Find stage. From the 167 RDF triples with predicate rdf:type, the crowd correctly classified 67 triples. Although the false positives were reduced in the Verify stage, the number of misclassified triples with RDF resources as objects are still high. Since the value of inter-rater agreement for this type of task is high, we can deduce that false positives are not necessarily generated by chance but the crowd recurrently confirms that these RDF triples are incorrect. These results suggest that assessing triples with RDF resources as objects without a proper rendering (human-readable information) is challenging for the crowd. Regarding the triples whose objects are external Web pages, in the Find stage the crowd correctly classified 35 out of the 36 triples, which is consistent with the behavior observed for this type of triples assessed in the Verify stage of the first workflow." assertion.
- section-number hasContent "5.3.5" assertion.
- section-5.3.5-title hasContent "Results: Incorrect Links" assertion.
- section-number hasContent "5.3" assertion.
- section-5.3-title hasContent "Evaluation of Using Microtrask Crowdsourcing in Find and Verify Stages" assertion.
- paragraph hasContent "We took the same set of resources from DBpedia, that were assigned to the LD experts, and performed baseline approaches for each type of quality issue. The results are discussed in this section." assertion.
- paragraph hasContent "We use the Test-Driven quality assessment (TDQA) methodology [21] as our main baseline comparison ap- proach to detect incorrect object values and datatypes and literals. TDQA is inspired from test-driven development and proposes a methodology to define (i) au- tomatic, (ii) semi-automatic and (iii) manual test cases based on SPARQL. Automatic test cases are generated based on schema constraints. The methodology suggests the use of semi-automatic schema enrichment that, in turn, will generate more automatic test cases. Manual test cases are written by domain experts and can be based either on a test case pattern library, or written as manual SPARQL queries." assertion.
- paragraph hasContent "RDFUnit 18 [20] is a tool that implements the TDQA methodology. RDFUnit generates automatic test cases for all enabled schemata and checks for common axiom validations. At the time of writing, RDFUnit supports the detection of inconsistencies for domain and range for RDFS and cardinality, disjointness, functionality, symetricity and reflexiveness for OWL under Closed World Assumption (CWA). We re-used the same setup for DBpedia, used by Kontokostas et al. [21] but excluding 830 test cases that were automatically generated for rdfs:range. The dataset was checked against the following schemata (namespaces): dbpedia-19 owl, foaf, dcterms, dc, skos, and geo. In addition, we reused the axioms produced by the ontology enrichment step for DBpedia, as described by Kontokostas et al. [21]." assertion.
- paragraph hasContent "In total, 5, 146 tests were run, in particular: 3, 376 were automatically generated from the tested vocabularies or ontologies, 1, 723 from the enrichment step and 47 defined manually. From the 5, 146 total test cases only 138 failed and returned a total 765 individual validation errors. Table 7 aggregates the test case results and violation instances based on the generation type. Although the enrichment based test cases were generated automatically, we distinguish them from those automatic test cases that were based on the original schema." assertion.
- paragraph hasContent "In Table 8, we aggregate the failed test cases and the total instance violations based on the patterns the test cases were based on. Most of the errors originate from ontological constraints such as functionality, datatype and domain violations. Common violation instances of ontological constraints are multiple birth/death dates and population values, datatype of xsd:integer instead of xsd:nonNegativeInteger and various rdfs:domain violations. In addition to ontological constraints, manual constraints resulted in violation instances such as: birth date after the death date (1), (probably) invalid postal codes (13), persons without a birth date (51), persons with death date that should also have a birth date (3), a resource with coordinates should be a dbpedia-owl:Place (16) and a dbpedia-owl:Place should have coordinates (7)." assertion.
- paragraph hasContent "In addition to the violation instances in Table 8, there exists 51 additional violation instances originated by a test case that was written as a manual SPARQL query and checks weather a person’s height is between 0.4 and 2.5 meters. In this specific case, the base unit is meters and the values are extracted as centimetre. Thus, although the results look valid to a user, they are actually wrong." assertion.
- paragraph hasContent "As a baseline approach a complete direct comparison is not possible except for 85 wrong datatypes and 13 failed regular expressions (cf. Table 8). However, even in this case it is not possible to provide a precision since RDFUnit runs through the whole set of resources and possibly catches errors the LD experts didn’t catch since it considers the ontological schema. This is because the LD experts performed triple based evalua- tion using the TripleCheckMate tool, which does not provide schema information directly. Thus, only those experts who are conversant with the schema might be able to identify those errors. Examples of such inconsistencies are datatype detection that is not defined in the ontology e.g. dates vs numbers (“1935”ˆˆxsd:integer ) or erroneous language tags. Also, rdfs:domain violations were not reported from the LD experts since for every triple they had to cross-check the ontology definitions for the evaluated property and the rdf:type statements of the resource. Similar combinations apply for all the other patterns types described in Table 8. RDFUnit was running beyond the isolated triple level that the LD experts and crowd were evaluating and was checking various combinations of triples." assertion.
- paragraph hasContent "However, using RDFUnit the set of incorrect values and datatypes and literals can be extracted and then fed to the LD experts to verify those errors as some of the errors require human judgement in terms of semantically incorrect triples. For example in case of logical inconsistencies, RDFUnit relies only on domain (dataset) experts to define custom rules, for example the human height constraint." assertion.
- section-number hasContent "5.4.1" assertion.
- section-5.4.1-title hasContent "Object Values, Datatypes and Literals" assertion.
- paragraph hasContent "For this type of task, we implemented a baseline that retrieves, for each triple, the external web page – which corresponds to the object of the triple – and searches for occurrences of the foaf:name of the subject within the page. If the number of occurrences is greater than 1, the algorithm interprets the external page as being related to the resource. In this case the link is considered correct. Listing 1 shows the script used to detect the number of times the title of the resource appears in the web page." assertion.
- table hasContent "Table 9" assertion.
- paragraph hasContent "In order to compare the baseline with the crowdsourcing approaches (i.e. detection whether the interlinks are correct), we first extracted the interlinks from the triples subject to crowdsourcing. A total of 2, 780 interlinks were retrieved. Table 9 shows the number and types of interlinks present in the dataset." assertion.
- paragraph hasContent "As a result of running this script, we detected a total of 2412 interlinks that were not detected to have the title of the resource in the external web page (link). In other words, only 368 of the total 2, 780 interlinks were detected to be correct by this automatic approach, achieving a precision of 0.1323." assertion.
- section-number hasContent "5.4.2" assertion.
- section-5.4.2-title hasContent "Interlinks" assertion.
- section-number hasContent "5.4" assertion.
- section-5.4-title hasContent "Evaluation of Baseline Approaches" assertion.
- section-number hasContent "5" assertion.
- section-5-title hasContent "Evaluation" assertion.
- paragraph hasContent "Referring back to the research questions formulated in Section 1, our experiments let us identify the strengths and weaknesses of applying crowdsourcing mechanisms for data quality assessment, following the Find-Fix-Verify pattern. Regarding the precision achieved in both workflows, we compared the outcomes produced in each stage by the different crowds against a manually defined gold standard; precision values achieved by both crowds show that crowdsourcing workflows offer feasible solutions to enhance the quality of Linked Data data sets (RQ1)." assertion.
- paragraph hasContent "In each type of task, the LD experts and MTurk workers applied different skills and strategies to solve the assignments successfully (RQ2). The data collected for each type of task suggests that the effort of LD experts must be applied on tasks demanding specific-domain skills beyond common knowledge. For instance, LD experts successfully identified issues on very specific datatypes, e.g., when time units are simply annotated as numbers ( xsd:Integer or xsd:Float ). In the same type of task, workers focused on assessing triples annotated with language tags, instead of datatypes like the experts. The MTurk crowd has proven to be very skilled at verifying whether literals are written in a certain language. In addition, workers were exceptionally good and efficient at performing comparisons between data entries, specially when some contextual information is provided." assertion.
- paragraph hasContent "Furthermore, we were able to detect common cases in which none of the two forms of crowdsourcing we studied seem to be feasible. The most problematic task for the LD experts was the one about discerning whether a web page is related to an RDF resource. Although the experimental data does not provide insights into this behavior, we are inclined to believe that this is due to the relatively higher effort required by this specific type of task, which involves checking an additional site outside the TripleCheckMate tool. Although the crowd outperformed the experts in finding incorrect ‘interlinks’, the MTurk crowd is not sufficiently capable of assessing links that correspond to RDF resources. Furthermore, MTurk workers did not perform so well on tasks about datatypes where they recurrently confused numerical datatypes with time units." assertion.
- paragraph hasContent "The observed results suggest that LD experts and crowd workers offer complementary strengths that can be exploited not only in different assessment iterations or stages (RQ3) but also in particular subspaces of quality issues. LD experts exhibited a good performance when finding incorrect object values and datatypes (in particular, numerical datatypes). In turn, microtask crowdsourcing can be effectively applied to: i) verify whether objects values are incorrect, ii) verify literals annotated with language tags, and iii) find and verify incorrect links of RDF resources to web pages." assertion.
- paragraph hasContent "One of the goals of our work is to investigate how the contributions of crowdsourcing approaches can be integrated into LD curation processes, by evaluating the performance of two crowdsourcing workflows in a cost-efficient way. In microtask settings, the first challenge is then to reduce the amount of tasks submitted to the crowd and the number of requested assign- ments (different answers), since both of these factors determine the overall cost of crowdsourcing projects. For the Find stage, Algorithm 1 generated 2, 339 HITs to crowdsource 68, 976 RDF triples, consistently with the property stated by Proposition 2. In our experiments, we approved a total of 2, 294 task solutions in the Find stage and, considering the payment per HIT (US$ 0.06), the total cost of this evaluation resulted in US$ 137.58. Furthermore, in the Verify stage, the cost of submitting to MTurk the problematic triples found by the experts was only US$ 43." assertion.
- section-number hasContent "6" assertion.
- section-6-title hasContent "Final Discussions" assertion.
- paragraph hasContent "Our work is situated at the intersection of the following research areas: Crowdsourcing Linked Data management and Web data quality assessment." assertion.
- paragraph hasContent "" assertion.