Template Acquisition for Open Event Extraction

ANR Project 2016-2019

(ANR-15-CE23-0018Acquisition de Schémas pour la Reconnaissance et l'Annotation d'Événements Liés)


(lab. LVIC)

Main contact: Xavier Tannier (Univ. Paris-Sud, LIMSI-CNRS)


Information and communication society led to the production of huge volumes of content. This content is still generally non-structured (text, images, videos) and the promises of a "Web of Knowledge" are still long ahead. This situation evolves with the development of Open Data portals or resources such as DBPedia, that have made easier the access to information stored in databases (economic or demographic statistics, world knowledge contained in Wikipedia infoboxes, etc). However, most of the knowledge is still produced by textual data. Among the information concerned by the difficulty of accessing textual data, those related to events are of great interest, notably in the context of the emergence of data journalism. Data journalism has been fed until now by publicly available, statistical data, but it has paradoxically made only little use of the very journalistic materials that are events. The project ASRAEL aims at bridging this gap.

Our proposal comes within the scope of the general scientific framework of information extraction (IE). We aim at extracting events from a large set of textual documents, without prior knowledge about them, and at populating and publishing a knowledge base of events. This knowledge base will be the support of a dedicated event search engine.

We define event in a traditional information extraction way. An event is a structured representation of something that happens, with a nucleus, a spatio-temporal context and some arguments. The "event type" gathers comparable instances of events, as "earthquake", "election" or "car race". Arguments are attribute/value pairs that characterize an event type (for an earthquake, its location, date, magnitude, casualties...). A template is the set of arguments that can describe an event type (earthquake template, election template). The generic representation of an event is based on the rule of the "5 Ws" (What, Who, Where, When, Why) that prevails in the "Anglo-Saxon" way of writing articles. This rule stipulates that a good description of an event must make these five elements explicit.

In automatic information extraction, the information about "Who", "Where" and "When" are extracted by a traditional and quite generic named entity recognition approach. On the other hand, the "What" is very domain-specific. For this reason, traditional IE systems lean on templates predefined by experts and identify events in texts with either rule-based systems or statistical models. However, in the general domain, where the huge number of possible events makes the manual definition of these templates impossible, information retrieval ("bag of words") methods take over, but do not provide a structured answer.

In this project, we aim to tackle the following challenges:

  • Discover automatically event templates from very large text corpora, and populate a knowledge base dedicated to events. This implies a mixture of supervised and non-supervised approaches, which is necessary as soon as one consider such a generic problem.
  • Use this knowledge base in order to build an event aggregator and a semantic search engine. With this engine, a user (either journalist or end-user) will be able to query for an event type (e.g. earthquake) and provide filters on attribute values (location = Turkey, magnitude > 8, etc). The knowledge base will also be published following the linked data principles for other to re-use.


Project Coordinator (team ILES). LIMSI's research fields cover a wide disciplinary spectrum from thermodynamics to cognition, encompassing fluid mechanics, energetics, acoustics and voice synthesis, spoken language and text processing, vision, virtual reality.

ILES stands for Information, Written and Signed Language. ILES specifically addresses the analysis, understanding and production of written language, and the modelling and production of signed language.

Members participating in the project group will bring their expertise in Information Extraction, Information Retrieval, Text Mining, temporal analysis and event analysis. This expertise has been validated by a number of publications in the field and several projects (including ANR ChronoLines). Furthermore, the evaluation methodology, another field of expertise in the team, is an integral task of this project.

Main contact: Xavier Tannier

CEA LIST institute carries out research on digital systems. Its R&D programs, all based on major economic and social implications focuses, deal with advanced manufacturing, embedded systems, ambient intelligence and ionizing radiation control for health applications.

Our knowledge engineering department develops non structured data automated analysis and description tools for knowledge extraction and delivery to the user under the form of an exploitable synthesis. We work on texts for translation, summary or social networks scanning purposes. We also work on image search and indexation, on documents semantic processing or media flows.

The team LVIC develops an open-source, modular language processing platform, LIMA including named entity recognition, parsing and certain semantic and discursive analysis (in both French and English). It also has expertise in Information Extraction, particularly through an system developed in the field of seismic events, or its participation in the Slot Filling task at the TAC-KBP challenge.

Main contact: Olivier Ferret

Agence France-Presse (AFP) participates in research projects through its R&D unit, the Medialab, composed of engineers and techies journalists. His role in the project is to provide the other partners with multimedia content, voicing of needs, both on the event extraction and on the expected search engine. The AFP Medialab team will be involved in evaluating the various technologies developed during the project and in the dissemination of results to the media industry, on the occasion of seminars, trade shows and conferences.

ASRAEL project will allow AFP to better structure its information content and to make more focused information retrieval into this content. Visualization of data and events should be more relevant, particularly from the AFP4W engine, originally developed during European project GLOCAL and the ANR Project ChronoLines.

Main contact: Denis Teyssou, @dteyssou

EURECOM is a Graduate school and Research Centre in Communication Systems located in the Sophia Antipolis technology park (French Riviera), a major European place for telecommunications activities. EURECOM research teams are made up of international experts, recruited at the highest level, whose work is regularly honored and has earned international recognition.

The group Multimedia Semantics and Interaction aims at providing semantic models for multimedia metadata and user social activity on the web in order to support users complex information needs and interaction, such as exploring large information spaces, gathering heterogeneous and distributed information, or personalizing system behaviour. We massively use Linked Data technologies to perform these tasks.

EURECOM provides expertise in named entity extraction, through the NERD framework, as well as in event modeling in cultural, touristic or media domains (projects EventMedia, 3cixty, collaborations with AFP and IPTC). EURECOM will play a key role in the conception and development of the semantic search engine, reusing certain software components developed within the HyperTED project.

Main contact: Raphaël Troncy


Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret and Romaric Besançon.
A Dataset for Open Event Extraction in English.
In proceedings of the 10th Language Resources and Evaluation Conference, 23-28 May 2016, Portorož (Slovenia).

Swen Ribeiro.
Extraction non supervisée de schémas d'événements dans les textes.
Présentation du projet à l'atelier "Journalisme Computationnel", Rennes, 15 mars 2016.

Dorian Kodelja, Romaric Besançon and Olivier Ferret.
Représentations et modèles en extraction d'événements supervisée.
Rencontres des Jeunes Chercheurs en Intelligence Artificielle (RJCIA 2017), Caen, France, 2017.

Julien Plu, Raphaël Troncy and Giuseppe Rizzo.
ADEL : une méthode adaptative de désambiguïsation d'entités nommées.
28th Journées Francophones d'Ingénierie des Connaissances (IC'17), pages 80-85, Caen, France, July 3-7, 2017.

Swen Ribeiro, Olivier Ferret and Xavier Tannier.
Unsupervised Event Clustering and Aggregation from Newswire and Web Articles.
in Proceedings of the 2nd workshop "Natural Language meets Journalism" (EMNLP 2017). Copenhagen, Denmark, September 2017.

Dorian Kodelja, Romaric Besançon, Olivier Ferret, Hervé Le Borgne and Emanuela Boros.
CEA LIST Participation to the TAC 2017 Event Nugget Track. .
TAC Analysis Conference, 2017.

Dorian Kodelja, Romaric Besançon and Olivier Ferret.
Intégration de contexte global par amorçage pour la détection d’événements.
25ème Conférence sur le Traitement Automatique des Langues Naturelles (CORIA-TALN-RJC 2018), Rennes, France, 2018.

Julien Plu, Roman Prokofyev, Alberto Tonon, Philippe Cudré-Mauroux, Djellel Eddine Difallah, Raphaël Troncy and Giuseppe Rizzo.
Sanaphor++: Combining Deep Neural Networks with Semantics for Coreference Resolution.
In 11th International Conference on Language Resources and Evaluation (LREC'18), Miyazaki, Japan, May 7-12, 2018.

Julien Plu, Kévin Cousot, Mathieu Lafourcade, Raphaël Troncy and Giuseppe Rizzo.
JeuxDeLiens: Word Embeddings and Path-Based Similarity for Entity Linking using the French JeuxDeMots Lexical Semantic Network.
In 25th French Conference on Natural Language Processing (TALN'18), Rennes, France, May 14-18, 2018.

Lorenzo Canale, Pasquale Lisena and Raphaël Troncy.
A Novel Ensemble Method for Named Entity Recognition and Disambiguation based on Neural Network.
In 17th International Semantic Web Conference (ISWC'18), Monterey, USA, October 8-12, 2018.

Julien Plu, Giuseppe Rizzo and Raphael Troncy.
ADEL: ADaptable Entity Linking.
In Semantic Web Journal, 2018.