Matches in Nanopublications for { ?s <http://purl.org/spar/c4o/hasContent> ?o ?g. }
- paragraph hasContent "There is wide agreement in the community that specific aspects of Linked Data management are inher- ently human-driven [2]. This holds true most notably for those Linked Data tasks which require a substantial amount of domain knowledge or detailed, context-specific insight that go beyond the assumptions and natural limitations of algorithmic approaches." assertion.
- footnote hasContent "Footnote 21" assertion.
- paragraph hasContent "Like any Web-centric community of its kind, Linked Data has had its share of volunteer initiatives, including the Linked Open Data Cloud itself and DBpe- dia [23], and competitions such as the yearly Semantic Web Challenge 20 and the European Data Innovator Award. [Footnote 21]" assertion.
- paragraph hasContent "From a process point of view, [41] introduced a methodology for publishing Linked Data. They dis- cussed activities which theoretically could be subject to crowdsourcing, but did not discuss such aspects explicitly. Similarly, [25] tried to map ontology engineering methodologies to Linked Data practice, drawing on insights from interviews with practitioners and quantitative analysis. A more focused account of the use of human and crowd intelligence in Linked Data man- agement is offered in [36]. The authors investigated several technically oriented scenarios in order to identify lower-level tasks and analyze the extent to which they can be feasibly automated. In this context, feasibility referred primarily to the trade-off between the effort associated with the usage of a given tool targeting automation - including aspects such as getting familiar with the tool, but more importantly creating training data sets and examples, configuring the tool and validating (intermediary) results - and the quality of the outcomes. The fundamental question the work attempted to answer was related to ours, though not focused on quality assurance and repair – their aim was come up with patterns for human and machine-driven computation, which could service semantic data management scenarios effectively. This was also at the core of [35], which took the main findings of this analysis a step further and proposed a methodology to build incentivized Semantic Web applications, including guidelines for mechanism design which are compatible to our fix-find-verify workflow. They have also analyzed motivators and incentives for several types of Semantic Web tasks, from ontology population to se- mantic annotation." assertion.
- paragraph hasContent "An important prerequisite to any participatory exercise is the ability of the crowd – experts of laymen – to engage with the given data management tasks. This has been subject to several user experience design studies [26,30,34,40,39], which informed the implementation of our crowdsourcing projects, both the contest, and the paid microtasks running on Mechanical Turk. For instance, microtasks have been used for entity link- ing [8] quality assurance, resource management [42] and ontology alignment [33]." assertion.
- paragraph hasContent "At a more technical level, many Linked Data management tasks have already been subject to human computation, be that in the form of games with a purpose [27,38] or, closer to our work, paid microtasks. Games with a purpose, which capitalize on entertain- ment, intellectual challenge, competition, and reputation, offer another mechanism to engage with a broad user base. In the field of semantic technologies, the OntoGame series proposes several games that deal with the task of data interlinking, be that in its ontology alignment instance (SpotTheLink [38]) or multimedia interlinking (SeaFish [37]). Similar ideas are imple- mented in GuessWhat?!, a selection-agreement game which uses URIs from DBpedia, Freebase and OpenCyc as input to the interlinking process [27]. While OntoGame looks into game mechanics and game narratives and their applicability to finding similar enti- ties and other types of correspondences, our research studies an alternative crowdsourcing strategy that is based on financial rewards in a microtask platform. Most relevant for our work are the experiments comparing games with a purpose and paid microtasks, which showed the complementarity of the two forms of crowdsourcing [32,9]." assertion.
- paragraph hasContent "A similar study is discussed in [28] for ontology alignment. McCann and colleagues studied mo- tivators and incentives in ontology alignment. They investigated a combination of volunteer and paid user involvement to validate automatically generated alignments formulated as natural-language questions. While this proposal shares many commonalities with the CrowdMap [33] approach, the evaluation of their solution is based on a much more constrained experiment that did not rely on a real-world labor market- place and associated work force." assertion.
- paragraph hasContent "Web data quality assessment. Existing frameworks for the quality assessment of the Web of Data can be broadly classified as automated (e.g. [15]), semiautomated (e.g. [11]) and manual (e.g.[4,29]). In particular, for the quality issues used in our experiments, [15] performs quality assessment on links but it fully automated and thus is limited as it does allow the user to choose the input dataset. Also, the incor- rect interlinks detected require human verification as they do not take the semantics into account. On the other hand, for detection of incorrect object values and datatypes and literals, the SWIQA framework [14] can be used by utilizing different outlier and clustering techniques. However, it lacks specific syntactical rules to detect all of the errors and requires knowledge of the underlying schema for the user to specify these rules. Other researchers analyzed the quality of Web [5] and RDF [17] data. The second study focuses on errors occurred during the publication of Linked Data sets. Recently, a study [18] looked into four million RD- F/XML documents to analyze Linked Data conformance. These studies performed large-scale quality assessment on LD but are often limited in their ability to produce interpretable results, demand user expertise or are bound to a given data set." assertion.
- paragraph hasContent "SPARQL Inferencing Notation (SPIN) 22 is a W3C submission aiming at representing rules and constraints on Semantic Web models using SPARQL. The approach described in [13] advocates the use of SPARQL and SPIN for RDF data quality assessment. In a similar way, Fürber et al. [12] define a set of generic SPARQL queries to identify missing or illegal literal values and datatypes and functional dependency violations. Another related approach is the Pellet Integrity Constraint Validator ICV 23 . Pellet ICV translates OWL integrity constraints into SPARQL queries. A more lightweight RDF constraint syntax, decoupled from SPARQL, is offered from Shape Expressions (ShEx) [31] and IBM Resource Shapes 24 ." assertion.
- section-number hasContent "7.1" assertion.
- section-7.1-title hasContent "Using Crowdsourcing in Linked Data Management" assertion.
- section-number hasContent "7" assertion.
- section-7-title hasContent "Related Work" assertion.
- paragraph hasContent "In this paper, we proposed and compared crowdsourcing mechanisms to evaluate the quality of Linked Data (LD); the study was conducted in particular on the DBpedia dataset. Two different types of crowds and mechanisms were investigated for the initial detec- tion of quality issues: object value, datatype and language tags, and interlinks. We secondly focused on adapting the Find-Fix-Verify crowdsourcing pattern to exploit the strengths of experts and lay workers and leverage the results from the Find-only approaches." assertion.
- paragraph hasContent "For the first part of our study, the Find stage was implemented using a contest-based format to engage with a community of LD experts in discovering and classifying quality issues of DBpedia resources. Contributions obtained through the contest (referring to flawed object values, incorrect datatypes and missing links) and were submitted to Amazon Mechanical Turk (MTurk), where we asked workers to Verify them. For the second part, only microtask crowdsourcing was used to perform the Find and Verify stages on the same set of DBpedia resources used in the first part." assertion.
- paragraph hasContent "The evaluation of the results showed that it is feasible to crowdsource the detection of flaws in LD sets; in particular, the experiments revealed that (i) lay workers are in fact able to detect certain quality issues with satisfactory precision; that (ii) experts perform well in identifying triples with ‘object value’ or ‘datatype’ issues, and lastly, (iii) the two approaches reveal complementary strengths." assertion.
- paragraph hasContent "Our methodology is applicable to any LD set and can be easily expanded to cover different types of quality issues. Our findings could also inform the design of the DBpedia extraction tools and related community processes, which already make use of contributions from volunteers to define the underlying map- ping rules in different languages. Finally, as with any form of computing, our work will be most useful as part of a broader architecture, in which crowdsourcing is brought together with automatic quality assessment and repair components and integrated into existing data governance frameworks." assertion.
- paragraph hasContent "Future work will first focus on conducting new experiments to test the value of the crowd for further different types of quality problems as well as for different LD sets from other knowledge domains. In the longer term, we will also concern ourselves with the question of how to optimally integrate crowd contributions – by implementing the Fix stage – into curation processes and tools, in particular with respect to the trade-offs of costs and quality between manual and automatic ap- proaches. Another area of research is the integration of baseline approaches before the crowdsourcing step in order to filter out errors that can be detected automatically to further increase the productivity of both LD experts and crowd workers." assertion.
- section-number hasContent "8" assertion.
- section-8-title hasContent "Conclusions and Future Work" assertion.
- paragraph hasContent "This work was supported by grants from the European Union’s 7th Framework Programme provided for the projects GeoKnow (GA no. 318159), Aligned (GA no. 644055) and LOD2 (GA no. 257943)." assertion.
- abstract hasContent "The Web has evolved into a huge mine of knowledge carved in different forms, the predominant one still being the free-text document. This motivates the need for Intelligent Web-reading Agents: hypothetically, they would skim through disparate Web sources corpora and generate meaningful structured assertions to fuel Knowledge Bases (KBs). Ultimately, comprehensive KBs, like Wikidata and DBpedia, play a fundamental role to cope with the issue of information overload. On account of such vision, this paper depicts the F ACT E XTRACTOR , a complete Natural Language Processing (NLP) pipeline which reads an input textual corpus and produces machine-readable statements. Each statement is supplied with a confidence score and undergoes a disambiguation step via entity linking, thus allowing the assignment of KB-compliant URIs. The system implements four research contributions: it (1) executes N-ary relation extraction by applying the Frame Semantics linguistic theory, as opposed to binary techniques; it (2) jointly populates both the T-Box and the A-Box of the target KB; it (3) relies on a lightweight NLP machinery, namely part-of-speech tagging only; it (4) enables a completely supervised yet reasonably priced machine learning environment through a crowdsourcing strategy. We assess our approach by setting the target KB to DBpedia and by considering a use case of 52, 000 Italian Wikipedia soccer player articles. Out of those, we yield a dataset of more than 213, 000 triples with a 78.5% F 1 . We corroborate the evaluation via (i) a performance comparison with a baseline system, as well as (ii) an analysis of the T-Box and A-Box augmentation capabilities. The outcomes are incorporated into the Italian DBpedia chapter, can be queried through its SPARQL endpoint, and/or downloaded as standalone data dumps. The codebase is released as free software and is publicly available in the DBpedia Association repository." assertion.
- paragraph hasContent "The World Wide Web is nowadays one of the most prominent sources of information and knowledge. De- spite the constantly increasing availability of semi-structured or structured data, a major portion of its content is still represented in an unstructured form, namely free text: deciphering its meaning is a complex task for machines and yet relies on subjective human interpretations. Hence, there is an ever growing need for Intelligent Web-reading Agents, i.e., Artificial Intelligence systems that can read and understand human language in documents across the Web. Ideally, these agents should be robust enough to interchange between heterogeneous sources with agility, while maintaining equivalent reading capabilities. More specifically, given a set of input corpora (where an item corresponds to the textual content of a Web source), they should be able to navigate from corpus to corpus and to extract compa- rable structured assertions out of each one. Ultimately, the collected data would feed a target Knowledge Base (KB), namely a repository that encodes areas of human intelligence into a richly shaped representation. Typically, KBs are composed of graphs, where real-world and abstract entities are bound together through relationships, and classified according to a formal description of the world, i.e., an ontology." assertion.
- paragraph hasContent "In this scenario, the encyclopedia Wikipedia contains a huge amount of data, which may represent the best digital approximation of human knowledge. Recent efforts, most notably DB PEDIA [23], F REEBASE [8], YAGO [21], and W IKIDATA [31], attempt to extract semi-structured data from Wikipedia in order to build KBs that are proven useful for a variety of applications, such as question answering, entity summarization and entity linking (EL), just to name a few. The idea has not only attracted a continuously rising commitment of research communities, but has also become a substantial focus of the largest Web companies. As an anecdotal yet remarkable proof, Google acquired Freebase in 2010, 1 embedded it in its K NOWLEDGE G RAPH , 2 and has lately opted to shut it down to the public. 3 Currently, it is foreseen that Freebase data will eventually migrate to Wikidata 4 via the primary sources tool, 5 which aims at standardizing the flow for data donations." assertion.
- paragraph hasContent "However, the reliability of a general-purpose KB like Wikidata is an essential requirement to ensure credible (thus high-quality) content: as a support for their trustworthiness, data should be validated against third-party resources. Even though the Wikidata community strongly agrees on the concern, 6 few efforts have been approached towards this direction. The addition of references to external (i.e., non-Wikimedia), authoritative Web sources can be viewed as a form of validation. Consequently, such real-world setting further consolidates the need for an intelligent agent that harvests structured data from raw text and produces e.g., Wikidata statements with reference URLs. Besides the prospective impact on the KB augmentation and quality, the agent would also dramatically shift the burden of manual data addition and curation, by pushing the (intended) fully human-driven flow towards an assisted paradigm, where automatic suggestions of pre-packaged statements just require to be approved or rejected. Figure 1 depicts the current state of the primary sources tool interface for Wikidata editors, which is in active development yet illustrates such future technological directions. Our system already takes part in the process, as it feeds the tool back-end." assertion.
- paragraph hasContent "On the other hand, the DBpedia E XTRACTION F RAMEWORK 7 is pretty much mature when dealing with Wikipedia semi-structured content like infoboxes, links and categories. Nevertheless, unstructured content (typically text) plays the most crucial role, due to the potential amount of extra knowledge it can deliver: to the best of our understanding, no efforts have been carried out to integrate an unstructured data extractor into the framework. For instance, given the Germany football team article, 8 we aim at extracting a set of meaningful facts and structure them in machine-readable statements. The sentence In Euro 1992, Germany reached the final, but lost 0–2 to Denmark would produce a list of triples, such as: (Germany, defeat, Denmark) (defeat, score, 0–2) (defeat, winner, Denmark) (defeat, competition, Euro 1992)" assertion.
- paragraph hasContent "To fulfill both Wikidata and DBpedia duties, we aim at investigating in what extent can the Frame Semantics theory [16,17] be leveraged to perform Information Extraction over Web documents. The main purpose of Information Extraction is to gather structured data from free text via Natural Language Processing (NLP), while Frame Semantics originates from linguistic research in Artificial Intelligence. A frame can be informally defined as an event triggered by some term in a text and embedding a set of participants, or Frame Elements (FEs). Hence, the aforementioned sentence would induce the DEFEAT frame (triggered by lost) together with the WINNER, COMPETITION, and SCORE participants. Such theory has led to the creation of FRAME NET [5,6], namely a lexical database with manually annotated examples of frame usage in English. FrameNet currently adheres to a rigorous protocol for data annotation and quality control. The activity is known to be expensive with respect to time and cost, thus constituting an encumbrance for the extension of the resource [4], both in terms of additional labeled sentences and of languages. To alleviate this, crowdsourcing the annotation task is proven to dramatically reduce the financial and temporal expenses. Consequently, we foresee to exploit the novel annotation approach described in [18], which provides full frame annotation in a single step and in a bottom-up fashion, thus being also more compliant with the definition of frames as per [17]." assertion.
- paragraph hasContent "In this paper, we focus on Wikipedia as the source corpus and on DBpedia as the target KB. We propose to apply NLP techniques to Wikipedia text in order to harvest structured facts that can be used to automatically add novel statements to DBpedia. Our FACT EXTRACTOR is set apart from related state of the art thanks to the combination of the following contributions:" assertion.
- paragraph hasContent "The remainder of this paper is structured as follows: Section 2 defines the specific problem we aim at tackling and illustrates the proposed solution. We introduce a use case in Section 3, which will drive the implementation of our system. Its high-level architecture is then described in Section 4, and devises the core modules, which we detail in Section 5, 6, 7, 8, and 9. A base- line system is reported in Section 10: this enables the comparative evaluation presented in Section 11, among with an assessment of the T-Box and A-Box enrichment capabilities. In Section 12, we gather a list of research and technical considerations to pave the way for future work. The state of the art is reviewed in Section 13, before our conclusions are drawn in Section 14.”" assertion.
- section-introduction-title hasContent "Introduction" assertion.
- paragraph hasContent "The main research challenge is formulated as a KB population problem: specifically, we tackle how to au- tomatically enrich DBpedia resources with novel state- ments extracted from the text of Wikipedia articles. We conceive the solution as a machine learning task implementing the Frame Semantics linguistic theory [16,17]: we investigate how to recognize meaningful factual parts given a natural language sentence as input. We cast this as a classification activity falling into the su- pervised learning paradigm. Specifically, we focus on the construction of a new extractor, to be integrated into the current DBpedia infrastructure. Frame Semantics will enable the discovery of relations that hold between entities in raw text. Its implementation takes as input a collection of documents from Wikipedia (i.e., the corpus) and outputs a structured dataset composed of machine-readable statements." assertion.
- section-2-title hasContent "Problem and Solution" assertion.
- paragraph hasContent "Soccer is a widely attested domain in Wikipedia: according to DBpedia, 9 the English Wikipedia counts a total of 223, 050 articles describing soccer-related entities, which is a significant portion (around 5%) of the whole chapter. Moreover, infoboxes on those articles are generally very rich (cf. for instance the Germany national football team article). On account of these ob- servations, the soccer domain properly fits the main challenge of this effort. Table 1 displays three examples of candidate statements from the Germany national football team article text, which do not exist in the corresponding DBpedia resource." assertion.
- section-3-title hasContent "Use Case" assertion.
- paragraph hasContent "The implementation workflow is intended as follows, depicted in Figure 2, and applied to the use case in Italian language:" assertion.
- paragraph hasContent "1. Corpus Analysis" assertion.
- paragraph hasContent "2. Supervised Fact Extraction" assertion.
- paragraph hasContent "3. Dataset Production: structuring the extraction results to fit the target KB (i.e., DBpedia) data model (i.e., RDF). A frame would map to a property, while participants would either map to subjects or to objects, depending on their role." assertion.
- section-4-title hasContent "System Description" assertion.
- paragraph hasContent "Wikipedia dumps 10 are packaged as XML documents and contain text formatted according to the Mediawiki markup syntax, 11 with templates to be transcluded. 12 Hence, a pre-processing step is required to obtain a raw text representation of the dump. To achieve this, we leverage the WIKI EXTRACTOR , 13 a third-party tool that retains the text and expands templates of a Wikipedia XML dump, while discarding other data such as tables, references, images, etc. We note that the tool is not completely robust with respect to templates expansion. Such drawback is expected for two reasons: first, new templates are constantly defined, thus requiring regular maintenance of the tool; second, Wikipedia editors do not always comply to the specifications of the templates they include. Therefore, we could not obtain a fully cleaned Wikipedia plain text corpus, and noticed gaps in its content, probably due to template expansion failures. Nevertheless, we argue that the loss of information is not significant and can be neglected despite the recall cost. From the entire Italian Wikipedia corpus, we slice the use case subset by querying the I TALIAN DBPEDIA CHAPTER 14 for the Wikipedia article IDs of relevant entities." assertion.
- paragraph hasContent "Given the use case corpus, we first extract the complete set of verbs through a standard NLP pipeline: to- kenization, lemmatization and POS tagging. POS in- formation is required to identify verbs, while lemmas are needed to build the ranking. TREE TAGGER 15 is exploited to fulfill these tasks. Although our input has a relatively low dimension (i.e., 7.25 million tokens circa), we observe that the tool is not able to handle it as a whole, since it crashes with a segmentation fault even on a powerful machine (i.e., 24 cores CPU at 2.53 GHz, 64 GB RAM). Consequently, we had to run it over each document, thus impacting on the processing time. However, we believe that further investigation will lead to the optimization of such issue." assertion.
- section-5.1-title hasContent "Lexical Units Extraction" assertion.
- paragraph hasContent "The unordered set of extracted verbs is the subject of a further analysis, which aims at discovering the most representative verbs with respect to the corpus. Two measures are combined to generate a score for each verb lemma, thus enabling the creation of a rank. We first compute the term frequency–inverse document frequency (TF-IDF) of each verb lexicalization (i.e., the occurring tokens) over each document in the corpus: this weighting measure is intended to capture the lexi- cographical relevance of a given verb, namely how important it is with respect to other terms in the whole corpus. Then, we determine the standard deviation value out of the TF-IDF scores list: this statistical measure is meant to catch heterogeneously distributed verbs, in the sense that the higher the standard deviation is, the more variably the verb is used, thus helping to understand its overall usage signal over the corpus. Ultimately, we produce the final score and assign it to a verb lemma by averaging all its lexicalizations scores. The top-N lemmas serve as candidate LUs, each evoking one or more frames according to the definitions of a given frame repository." assertion.
- section-5.2-title hasContent "Lexical Units Ranking" assertion.
- section-5-title hasContent "Corpus Analysis" assertion.
- paragraph hasContent "Among the top 50 LUs that emerged from the corpus analysis phase, we manually selected a subset of 5 items to facilitate the full implementation of our pipeline. Once the approach has been tested and evaluated, it can scale up to the whole ranking (cf. Section 12 for more observations). The selected LUs comply to two criteria: first, they are picked from both the best and the worst ranked ones, with the purpose of assessing the validity of the corpus analysis as a whole; second, they fit the use case domain, instead of being generic. Consequently, we proceed with the following LUs: esordire (to start out), giocare (to play), perdere (to lose), rimanere (to stay, remain), and vincere (to win)." assertion.
- paragraph hasContent "The next step consists of finding a language resource (i.e., frame repository) to suitably represent the use case domain. Given a resource, we first need to define a relevant subset, then verify that both its frame and FEs definitions are a relevant fit. After an investigation of FrameNet and K ICKTIONARY [29], we notice that:" assertion.
- paragraph hasContent "Therefore, we adopted a custom frame repository, max- imizing the reuse of the available ones as much as possible, thus serving as a hybrid between FrameNet and Kicktionary. Moreover, we tried to provide a challenging model for the classification task, prioritizing FEs overlap among frames and LU ambiguity (i.e., focusing on very fine-grained semantics with subtle sense differences). We believe this does not only apply to machines, but also to humans: we can view it as a stress test both for the machine learning and the crowdsourcing parts. A total of 6 frames and 15 FEs are modeled with Italian labels as follows:" assertion.
- section-6-title hasContent "Use Case Frame Repository" assertion.
- paragraph hasContent "The first step involves the creation of the training set: we leverage the crowdsourcing platform CROWDFLOWER 16 and the method described in [18], which requires users to detect the core FEs: these are the fundamental items to distinguish between frames, as opposed to extra ones, thus allowing to automatically induce the correct frame. The training set has a double outcome, as it will feed two classifiers: one will identify FEs, and the other is responsible for frames." assertion.
- paragraph hasContent "Both frame and FEs recognition are cast to a multi- class classification task: while the former can be related to text categorization, the latter should answer questions such as can this entity be this FE? or is this entity this FE in this context?. Such activity boils down to semantic role labeling (cf. [24] for an introduction), and usually requires a more fine-grained text analysis. Previous work in the area exploits deeper NLP layers, such as syntactic parsing (e.g., [25]). We alleviate this through EL techniques, which perform word sense dis- ambiguation by linking relevant parts of a source sentence to URIs of a target KB. We leverage THE WIKI MACHINE 17 [19], a state-of-the-art [26] approach conceived for connecting text to Wikipedia URLs, thus inherently entailing DBpedia URIs. EL results are part of the FE classifier feature set. We claim that EL enables the automatic addition of features based on existing entity attributes within the target KB (notably, the class of an entity, which represents its semantic type)." assertion.
- paragraph hasContent "Given as input an unknown sentence, the full frame classification workflow involves the following tasks: tokenization, POS tagging, EL, FE classification, and frame classification." assertion.
- paragraph hasContent "The seed selection procedure allows to harvest meaningful sentences from the input corpus, and to feed the classifier. Therefore, its outcome is two-fold: to build a representative training set and to extract relevant sentences for classification. We experimented multiple strategies as follows. They all share the same base constraint, i.e., each seed must contain a LU lexicalization." assertion.
- paragraph hasContent "First, we note that all the strategies but the baseline necessitate a significant cost overhead in terms of lan- guage resources availability and engineering. Further-more, given the soccer use case input corpus of 52, 000 articles circa, those strategies dramatically reduce the number of seeds, while the baseline performed an extraction with a .95 article/seed ratio (despite some noise). Hence, we decided to leverage the baseline for the sake of simplicity and for the compliance to our contribution claims. We still foresee further investigation of the other strategies for scaling besides the use case." assertion.
- section-7.1-title hasContent "Seed Selection" assertion.
- paragraph hasContent "We apply the one-step, bottom-up approach described in [18] to let the crowd perform a full frame annotation over a set of training sentences. The set is randomly sampled from the input corpus and contains 3, 055 items. The outcome is the same amount of frame examples and 55, 385 FE examples. The task is sent to the CrowdFlower platform." assertion.
- paragraph hasContent "We ask the crowd to (a) read the given sentence, (b) focus on the “topic” (i.e., the frame) written above it, and (c) assign the correct “label” (i.e., the FE) to each “word” (i.e., unigram) or “group of words” (i.e., n-grams) from the multiple choices provided below each n-gram. Figure 3 displays the front-end interface of a sample sentence." assertion.
- paragraph hasContent "During the preparation phase of the task input data, the main challenge is to automatically provide the crowd with relevant candidate FE text chunks, while minimizing the production of noisy ones. To tackle this, we experimented with the following chunking strategies:" assertion.
- paragraph hasContent "We surprisingly observed that the full-stack pipeline outputs a significant amount of noisy chunks, besides being the slowest strategy. On the other hand, the cus- tom chunker was the fastest one, but still too noisy to be crowdsourced. EL resulted in the best trade-off, and we adopted it for the final task. Obviously, noise cannot be automatically eliminated: we cast such validation to the crowd by allowing the None answer along with the candidate FE labels." assertion.
- paragraph hasContent "The task parameters are as follows:" assertion.
- paragraph hasContent "The outcomes are resumed in Table 2." assertion.
- paragraph hasContent "Finally, the crowdsourced annotation results are processed and translated into a suitable format to serve as input training data for the classifier." assertion.
- section-7.2.1-title hasContent "Task Anatomy" assertion.
- section-7.2-title hasContent "Training Set Creation" assertion.
- paragraph hasContent "We train our classifiers with the following linguistic features, in the form of bag-of-features vectors:" assertion.
- section-7.3-title hasContent "Frame Classification: Features" assertion.
- section-7-title hasContent "Supervised Fact Extraction" assertion.
- paragraph hasContent "During our pilot crowdsourcing annotation experiments, we noticed a low agreement on numerical FEs. Moreover, asking the crowd to label such frequently occurring FEs would represent a considerable overhead, resulting in a higher temporal cost (i.e., more annotations per sentence) and lower overall annotation accuracy. Hence, we opted for the implementation of a rule-based system to detect and normalize numerical expressions. The normalization process takes as input a numerical expression such as a date, a duration, or a score, and outputs a transformation into a standard format suitable for later inclusion into the target KB." assertion.
- paragraph hasContent "The task is not formulated as a classification one, but we argue it is relevant for the completeness of the extracted facts: rather, it is carried out via matching and transformation rule pairs. Given for instance the input expression tra il 1920 e il 1925 (between 1920 and 1925), our normalizer first matches it through a regular expression rule, then applies a transformation rule complying to the XML Schema Datatypes 20 (typically dates and times) standard, and finally produces the following output: 21" assertion.
- paragraph hasContent "All rule pairs are defined with the programming language-agnostic YAML 22 syntax. The pair for the above example is as follows:" assertion.
- section-8-title hasContent "Numerical Expressions Normalization" assertion.
- paragraph hasContent "The integration of the extraction results into DBpedia requires their conversion to a suitable data model, i.e., RDF. Frames intrinsically bear N-ary relations through FEs, while RDF naturally represents binary relations. Hence, we need a method to express FEs relations in RDF, which we call reification. This can be achieved in multiple ways:" assertion.
- paragraph hasContent "A recent overview [20] highlighted that all the mentioned strategies are similar with respect to query performance. Given as input n frames and m FEs, we argue that:" assertion.
- paragraph hasContent "We opted for the less verbose strategy, namely N-ary relations. Given the running example sentence In Euro 1992, Germany reached the final, but lost 0–2 to Denmark, classified as a DEFEAT frame and embedding the FEs WINNER, LOSER, COMPETITION, SCORE, we generate RDF as per the following Turtle serialization:" assertion.
- paragraph hasContent "We add an extra instance type triple to assign an ontology class to the reified frame, as well as a provenance triple to indicate the original sentence:" assertion.
- paragraph hasContent "It is not trivial to decide on the subject of the main frame statement, since not all frames are meant to have exactly one core FE that would serve as a plausible logical subject candidate: most have many, e.g., FINISH_COMPETITION has COMPETITION, COMPETITOR and OPPONENT as core FEs in FrameNet. Therefore, we tackle this as per the following assumption: given the encyclopedic nature of our input corpus, both the logical and the topical subjects correspond in each document. Hence, each candidate sentence inherits the document subject. We acknowledge that such assumption strongly depends on the corpus: it applies to entity-centric documents, but will not perform well for general-purpose ones such as news articles. However, we believe it is still a valid in-scope solution fitting our scenario." assertion.
- paragraph hasContent "Besides the fact datasets, we also keep track of confidence scores and generate additional datasets accord- ingly. Therefore, it is possible to filter facts that are not considered as confident by setting a suitable threshold. When processing a sentence, our pipeline outputs two different scores for each FE, stemming from the entity linker and the supervised classifier. We merge both signals by calculating the F-score between them, as if they were representing precision and recall, in a fashion similar to the standard classification metrics. The final score can be then produced via an aggregation of the single FE scores in multiple ways, namely: (a) arithmetic mean; (b) weighted mean based on core FEs (i.e., they have a higher weight than extra ones); (c) harmonic mean, weighted on core FEs as well." assertion.
- section-9.1-title hasContent "Confidence Scores" assertion.
- section-9-title hasContent "Dataset Production" assertion.
- paragraph hasContent "To enable a performance evaluation comparison with the supervised method, we developed a rule-based algorithm that handles the full frame and FEs annotation. The main intuition is to map FEs defined in the frame repository to ontology classes of the target KB: such mapping serves as a set of rule pairs (FE, class), e.g., (WINNER , SoccerClub). In the FrameNet terminology, this is homologous to the assignment of semantic types to FEs: for instance, in the ACTIVITY frame, the AGENT is typed with the generic class Sentient. The idea would allow the implementation of the bottom-up one-step annotation flow described in [18]: to achieve this, we run EL over the input sentences and check whether the attached ontology class metadata appear in the frame repository, thus fulfilling the FE classification task." assertion.
- paragraph hasContent "Besides that, we exploit the notion of core FEs: this would cater for the frame disambiguation part. Since a frame may contain at least one core FE, we proceed with a relaxed assignment, namely we set the frame if a given input sentence contains at least one entity whose ontology class maps to a core FE of that frame. The implementation workflow is illustrated in Algorithm 1: it takes as input the set S of sentences, the frame repository F embedding frame and FEs labels, core/non-core annotations and rule pairs, and the set L of trigger LU tokens." assertion.
- paragraph hasContent "It is expected that the relaxed assignment strategy will not handle the overlap of FEs across competing frames that are evoked by a single LU. Therefore, if at least one core FE is detected in multiple frames, the baseline makes a random assignment for the frame. Furthermore, the method is not able to perform FE classi- fication in case different FEs share the ontology class (e.g., both WINNER and LOSER map to SoccerClub): we opt for a FE random guess as well." assertion.
- section-10-title hasContent "Baseline Classifier" assertion.
- paragraph hasContent "We assess our main research contributions through the analysis of the following aspects:" assertion.
- paragraph hasContent "Table 3 describes the overall performance of the baseline and the supervised system over a gold standard dataset. We randomly sampled 500 sentences containing at least one occurrence of our use case LU set from the input corpus. We first outsourced the annotation to the crowd as per the training set construction and the results were further manually validated twice by the authors. Measures are computed as follows: (1) a true positive is triggered if the predicted label is correct and the predicted text chunk at least partially matches the expected one; (2) chunks that should not be labeled are marked with a “O” and not explicitly counted as true or false positives." assertion.
- paragraph hasContent "Figure 4 and Figure 5 respectively display the FE and frame classification confusion matrices. First, since the “O” markers were discarded from the evaluation, all the respective performance measures amount to 0." assertion.
- paragraph hasContent "Concerning FEs, we observe that COMPETIZIONE is frequently mistaken for PREMIO and ENTITÀ , while rarely for TEMPO and DURATA , or just missed. On the other hand, TEMPO is mistaken for COMPETIZIONE : our hypothesis is that competition mentions, such as World Cup 2014, are disambiguated as a whole entity by the linker, since a specific target Wikipedia article exists. However, it overlaps with a temporal expression, thus confusing the classifier. AGENTE is often mistaken for ENTITÀ , due to their equivalent semantic type, which is always a person. Finally, we notice that some FEs (i.e., SQUADRA_1, SQUADRA_2 and CLASSIFICA) did not appear in the training set, but did in the gold standard, and viceversa (i.e., ENTITÀ )." assertion.
- paragraph hasContent "With respect to frames, we note that ATTIVITÀ is often mistaken for STATO or not classified at all: in fact, the difference between these two frames is quite subtle with respect to their sense. The former is more generic and could also be labeled as CAREER : if we viewed it in a frame hierarchy, it would serve as a superframe of the latter. The latter instead encodes the development modality of a soccer player’s career, e.g., when he remains unbound from some team due to contracting issues. Hence, we may conclude that distin- guishing between these frames is a challenge even for humans. Furthermore, frames with no FEs are classified as “O”, thus considered wrong despite the correct prediction. VITTORIA is almost never mistaken for TROFEO : this is positively surprising, since the FE COMPETIZIONE (frame VITTORIA) is often mistaken for PREMIO (frame TROFEO), but those FEs do not seem to affect the frame classification. Again, such FE distinction must take into account a delicate sense nuance, which is hard for humans as well. Due to an error in the training set crowdsourcing step, we lack of VITTORIA and PARTITA samples." assertion.
- paragraph hasContent "Figure 6 and Figure 7 respectively plot the FE and frame classification performance, broken down to each label." assertion.
- section-11.1.1-title hasContent "Supervised Classification Performance Breakdown" assertion.
- section-11.1-title hasContent "Classification Performance" assertion.
- paragraph hasContent "One of our main objectives is to extend the target KB ontology with new properties on existing classes. We focus on the use case and argue that our approach will have a significant impact if we manage to identify non-existing properties. This would serve as a proof of concept which can ideally scale up to all kinds of input. In order to assess such potential impact in discovering new relations, we need to address the following question: which extractable relations are not already mapped in DBPO or do not even exist in the raw infobox properties datasets?. Table 4 illustrates an empirical lexicographical study gathered from the Italian Wikipedia soccer player subcorpus (circa 52,000 articles). It contains absolute occurrence frequencies of word stems (in descending order) that are likely to trigger domain-relevant frames, thus providing a rough overview of the extraction potential." assertion.
- paragraph hasContent "The corpus analysis phase (cf. Section 5) yielded a ranking of LUs evoking the frames ACTIVITY, DEFEA, MATCH, TROPHY, STATUS, and VICTORY: these frames would serve as ontology property candidates, together with their embedded FEs. DBPO already has most of the classes that are needed to represent the main entities involved in the use case: SoccerPlayer, SoccerClub, SoccerManager, SoccerLeague, SoccerTournament, SoccerClubSeason, SoccerLeagueSeason, although some of them lack an exhaustive description (cf. SoccerClubSeason 26 and SoccerLeagueSeason). 27" assertion.
- paragraph hasContent "For each of the aforementioned classes, we computed the amount and frequency of ontology and raw infobox properties via a query, with results in ascending order of frequency: Figure 8 illustrates their distribution. The horizontal axis stands for the normalized (log scale) frequency, encoding the current usage of properties in the target KB; the vertical axis represents the ratio (which we call coverage) between the position of the property in the ordered result set of the query and the total amount of distinct properties (i.e., the size of the result set). Properties with a null frequency are ignored." assertion.
- paragraph hasContent "First, we observe a lack of ontology property usage in 4 out of 7 classes, probably due to missing mappings between Wikipedia template attributes and DBPO. On the other hand, the ontology properties have a more homogenous distribution compared to the raw ones: this serves as an expected proof of concept, since the main purpose of DBPO and the ontology mappings is to merge heterogenous and multilingual Wikipedia template attributes into a unique representation. In average, most raw properties are concentrated below coverage and frequency threshold values of 0.8 and 4 respectively: this means that roughly 80% of them have a significantly low usage, further highlighted by the log scale. While ontology properties are better distributed, most still do not reach a high coverage/frequency tradeoff, except for SoccerPlayer, which benefits from both rich data (cf. Section 3) and mappings. 28" assertion.
- paragraph hasContent "On the light of the two analyses discussed above, it is clear that our approach would result in a larger variety and finer granularity of facts than those encoded into Wikipedia infoboxes and DBPO classes. Moreover, we believe the lack of dependence on infoboxes would enable more flexibility for future generalization to sources beyond Wikipedia." assertion.
- paragraph hasContent "Subsequent to the use case implementation, we manually identified the following mappings from frames and FEs to DBPO properties:" assertion.
- paragraph hasContent "In conclusion, we claim that 3 out 6 frames and 12 out of 15 FEs represent novel T-Box properties." assertion.
- section-11.2-title hasContent "T-Box Enrichment" assertion.
- paragraph hasContent "Our methodology enables a joint T-Box and A-Box augmentation: while frames and FEs serve as T-Box properties, the extracted facts feed the A-Box part. Out of 49, 063 input sentences, we generated a total of 213, 479 and 216, 451 triples (i.e., with a 4.35 and 4.41 ratio per sentence) from the supervised and the baseline classifiers respectively. 52% and 55% circa are considered confident, namely facts with confidence scores (cf. Section 9.1) above the dataset average threshold." assertion.