ASRAEL ANR Project

Overview

Information and communication society led to the production of huge volumes of content. This content is still generally non-structured (text, images, videos) and the promises of a "Web of Knowledge" are still long ahead. This situation evolves with the development of Open Data portals or resources such as DBPedia, that have made easier the access to information stored in databases (economic or demographic statistics, world knowledge contained in Wikipedia infoboxes, etc). However, most of the knowledge is still produced by textual data. Among the information concerned by the difficulty of accessing textual data, those related to events are of great interest, notably in the context of the emergence of data journalism. Data journalism has been fed until now by publicly available, statistical data, but it has paradoxically made only little use of the very journalistic materials that are events. The project ASRAEL aims at bridging this gap.

Our proposal comes within the scope of the general scientific framework of information extraction (IE). We aim at extracting events from a large set of textual documents, without prior knowledge about them, and at populating and publishing a knowledge base of events. This knowledge base will be the support of a dedicated event search engine.

We define event in a traditional information extraction way. An event is a structured representation of something that happens, with a nucleus, a spatio-temporal context and some arguments. The "event type" gathers comparable instances of events, as "earthquake", "election" or "car race". Arguments are attribute/value pairs that characterize an event type (for an earthquake, its location, date, magnitude, casualties...). A template is the set of arguments that can describe an event type (earthquake template, election template). The generic representation of an event is based on the rule of the "5 Ws" (What, Who, Where, When, Why) that prevails in the "Anglo-Saxon" way of writing articles. This rule stipulates that a good description of an event must make these five elements explicit.

In automatic information extraction, the information about "Who", "Where" and "When" are extracted by a traditional and quite generic named entity recognition approach. On the other hand, the "What" is very domain-specific. For this reason, traditional IE systems lean on templates predefined by experts and identify events in texts with either rule-based systems or statistical models. However, in the general domain, where the huge number of possible events makes the manual definition of these templates impossible, information retrieval ("bag of words") methods take over, but do not provide a structured answer.

In this project, we aim to tackle the following challenges:

Discover automatically event templates from very large text corpora, and populate a knowledge base dedicated to events. This implies a mixture of supervised and non-supervised approaches, which is necessary as soon as one consider such a generic problem.
Use this knowledge base in order to build an event aggregator and a semantic search engine. With this engine, a user (either journalist or end-user) will be able to query for an event type (e.g. earthquake) and provide filters on attribute values (location = Turkey, magnitude > 8, etc). The knowledge base will also be published following the linked data principles for other to re-use.

Work packages

1. Coordination

2. Extraction of events and generic attributes

Tasks 2, 3 and 4 are intended to automatically discover the schemas (sets of attributes/values) corresponding to events and, in parallel, to build the instances of these schemas by annotating the documents, for the creation of the search engine in task 5. The first attributes explored will be the generic attributes, in particular dates and locations (task 2), then the others (task 3). Task 4 describes how the knowledge base will be populated and the documents annotated by the template put in place.

The seeds of this schema discovery process are names representing types of events. These names will be extracted from the list of International Press Telecommunications Council (IPTC) categories, a very complete hierarchy that is already the entry point of the existing event search engine at AFP. This hierarchy contains theme names, many of which are event names (for example, "road accidents" is a subtype of "transport accidents", itself a subtype of "disasters and accidents"). All themes are codified and internationalized.

3. Structure of the event base

Even though the global event representation framework is already defined (kernel and arguments in the form of attribute/value pairs), a preliminary step will be to design the structure and content of the event database. This effort goes first and foremost by a thorough reflection on the modeling of the events and in particular on the types of attributes, their granularity, their evolution over time. Events can be interrelated with causal relationships. They can also belong to series (e.g. the Olympic Games, Grammy Awards). An ontology on media events will be created at the beginning of the project, taking into account the needs expressed by the journalists and exchanges between the Medialab AFP and the scientists of CEA LIST, LIMSI and EURECOM. The partners will be able to rely in this field on the knowledge acquired by the Medialab AFP during the European projects Glocal (search and indexing of events, XML modeling) and French SCRIBO (semantic web), as well as those of EURECOM (EventMedia project and animation of the dedicated schema.org community)

The definition of a structured basis of events also depends on the definition of its implementation, that is to say the choice of a representational formalism. As this base evolves as new types of events and events are discovered, the use of a triplestore is ideal. A flexible data model as defined by OWL will allow to add new attributes for the definition of event schemas, to make SPARQL queries, implying in some cases inferences, to use the taxonomy IPTC categories (Subject, Matter, Details) and to support multilingualism ("Putin", "Poutine"...). In the ontology model, a type of event will be a class and the attributes of the template of that type of ObjectProperty or DataProperty. Finally, lists of authorities (e.g. event categories) will be represented in SKOS to allow the same type of query. The search engine (Task 5) will use these attributes for constructing the index with attributes of event types as facets. The documents will be related to the ontology through the events that will be extracted and their constituents. We can also use the ontology to make inferences about indexed resources and exploit their results in full text search: the documents resulting from this search can for example be filtered or grouped according to types of events or entities inferred and not directly explained within them. We will use Virtuoso tools (for querying structured data in SPARQL), ElasticSearch and Solr (for full-text querying of data) as we have experienced in the HyperTED project.

In order to get closer to the standards of AFP and IPTC, the processed documents will be exported in NewsMLG2 format. This will make it easy to associate metadata about events to each document, but also to annotate the HTML content of the document with micro-formats (RDFa or Microdata using the rNews and schema.org vocabularies). An OSGI content annotation string will allow the industrialization of the different components of the project with the flexibility to use other software. Moreover, we will also keep the traces of the induction process that led to each iteration to the content of the database, including the probability values resulting from induction, in order to be able to filter the information according to their degree of reliability (or of confidence).

4. Populating the event base

This task represents the second logical step in building a structured event base. It takes into account two forms of population of such a base: on the one hand, an initial population from a large body of news, linked to the discovery of the types of events present in this corpus; on the other hand, the taking into account of new articles for the continuous supply of this base.

5. Search Engine

This Task 5 aims to implement the search engine from annotated schemas and documents, results of Tasks 3 and 4. As mentioned above, querying this search engine will be done by using structured queries (an event type and constraints on attribute values), with a structure depending on the type of event. This engine will be integrated with the existing tool at AFP, namely a web interface (AFP-4W) calling the search engine Lucene via Solr; the goal of the project is to obtain a functional and testable prototype by AFP journalists, it will be necessary to manage the daily flow of new documents. In addition, a version of this engine will also be provided for corpora of web pages.

6. Evaluation

The evaluation processes for each of the steps described above have been grouped into this task. Indeed, if each evaluation methodology has its specificities, the same subset of event types will be selected and will be used throughout the project for the evaluation. These types will be chosen for their representativeness, but also for their respect of the maximum of criteria above:

Existence of an expert study already done to determine the relevant attributes of the event type. This could come from systems that have already been developed or from a body of reference (MUC, ACE, Giga Corpus, etc.) acquired by the partners.
Existence of a dedicated information retrieval system. For example, if a seismic system exists (and was not used for the development phase), this type of event will be a good candidate to participate in the assessment. Thus, it will not only be possible to compare the results of the project to an "ideal" reference, but also to a dedicated system.

The evaluation of the acquisition of the schemas is carried out very imperfectly in the literature, by using corpora dedicated to the extraction of supervised information. Since the schemes and the induced roles are not generally named, part of the corpus is used in a supervised way to associate the schemas of the reference with the induced schemas. At the same time, we evaluate the induction, this final association, and the annotation of the documents. In our project, on the contrary, we wish to separately evaluate the diagrams produced and the annotation of the documents.

Template Acquisition for Open Event Extraction

ANR Project 2016-2019

(ANR-15-CE23-0018 ‐ Acquisition de Schémas pour la Reconnaissance et l'Annotation d'Événements Liés)

LIMSI-CNRS

(Team ILES, @LIMSI_NLP)

CEA-LIST

(lab. LVIC)

AFP

(MediaLab)

EURECOM

Overview

Work packages

1. Coordination

2. Extraction of events and generic attributes

3. Structure of the event base

4. Populating the event base

5. Search Engine

6. Evaluation

Code

Source Extractor

News-KB

Search Engine

Partners

Publications