HAREM: NER for Portuguese

Linguateca

Em português


An updated version of the package with all final Second HAREM resources is available from http://www.linguateca.pt/HAREM/PacoteRecursosSegundoHAREM.zip (see Readme.txt), together with a relation glossary. The package includes:

For compatibility reasons, we also make available here the golden collection just with NE annoation, CDSegundoHAREMclassico.xml, as well as LÂMPADA 1.0 if one is interested in repeating exactly the Second HAREM.

Publication of the Second HAREM book is ready.


What is HAREM?

HAREM is an evaluation contest for named entity recognition in Portuguese. Its first edition (First HAREM) was initiated in September 2004, comprised two evaluation events, and officially ended in the First HAREM Workshop in Porto, 15 July 2006.

The current edition of HAREM (Second HAREM) is currently taking place NOW (see calendar below).

Who organizes HAREM?

Linguateca organizes HAREM in the scope of its IRE model (Information, Resources, and Evaluation).

The Second HAREM has currently as organizers the following members of the Linguateca team: Diana Santos (coord.), Cláudia Freitas, Hugo Oliveira, David Cruz, Paula Carvalho, Luís Miguel Cabral and Cristina Mota (since May 2008).

The First HAREM had Diana Santos and Nuno Cardoso as coordinators, and Nuno Seco, Rui Vilela, Paulo Rocha, Susana Afonso and Anabela Barreiro as further members og the organization.

Guidelines

So far, only available in Portuguese

We strongly encourage participants to make heavy use of the input validator provided by the organization.

Full description of the syntax (in Portuguese) and the list of words in lower case accepted as part of a NE are also available (sintaxe and minusculas).

Evaluation measures

They are described and exemplified in Second HAREM: Evaluation.

Briefly, we are using a generalization of First HAREM's CSC for scoring semantic classification, as well as using the usual measures of precision, recall, overgeneration, undergeneration and F-measure.

The two main differences regarding the First HAREM are (i) no longer considering partially correct NEs, but (ii) systematically coding different possible delimitation through the ALT tag, which systems are also encouraged to use in their output. In fact, in addition to vague classifications such as <EM CATEG="PESSOA|ORGANIZACAO">, we also expect that systems code more than one alternative of identification with the <ALT> syntax.

See some detailed examples of evaluation here:

Finally, we have provided also separate evaluation measures for

HAREM resources

The Second HAREM collection is already available, as well as information about source, language variety, origin and text genre: The only small differences relative to the one provided to the participants are: Also, the golden colection has been made available:

Training resources for the Second HAREM

We have developed some examples of the full syntax of the Second HAREM:

Currently, you can access the (preliminary) resources compiled in the First HAREM and transformed according to the Second HAREM guidelines (basic, no SUBTIPO, no TEMPO yet): But please note that not every problem in these golden collections has been solved, so when the training material disagrees with the guidelines, the guidelines will take precedence.

Finally, the TEMPO group also provided us with the first 10% of the previous golden collection from MiniHAREM:

Evaluation programs

Due to the change of measures and the change of HAREM syntax from the First to the Second HAREM, the programs had to be modified, and in, some cases, written from scratch.

Results

Currently, you can inspect the results for

Schedule

Until 10th November 2007
Register as a prospective participant open: 22 groups have registered for the Second HAREM as a response for the call for interest.
Until 30th November 2007
General discussion about how the Second HAREM was going to be
December 2007
Preliminary guidelines and example collections made available.
January 2008
Final guidelines available, together with the evaluation architecture, and (training) evaluation resources conforming to those guidelines.
14 - 28 April 2008
Evaluation contest took place (submissions only open for 48h after download of the collection): See the final participating team (10 systems) from the 16 systems enrolled for the Second HAREM.
8 May 2008
Second HAREM collection and its metadata was made available.
16 May 2008
First version of TEMPO golden collection available for inspection.
4 June 2008
First version of TEMPO golden collection available for inspection. Final version of the Second HAREM golden collection (classical mode) available.
6 June 2008
First version of ReRelEM golden collection available for inspection.
12 June 2008
Final version of TEMPO golden collection.
19 June 2008
HAREM results (main track, classical mode) available.
25 June 2008
TEMPO results made available.
31 July 2008
Final version of ReRelEM golden collection available.
6 August 2008
ReRelEM results made available.
8 August 2008
Final individular reports (except for ReRelEM) made available.
21 August 2008
New results of ReRelEM made available.
7 September 2008
HAREM workshop, as satelite of PROPOR 2008
12 October 2008
Papers for the book on the Second HAREM due.
17 November 2008
Final resources packaged.
25 July 2009
The book on the Second HAREM was made publicly available.
7 April 2010
New ReRelEM golden collection, covering the complete HAREM golden collection, available.
27 April 2010
LÂMPADA 2.0 delivered.

More information about HAREM

We have published the following book on the Second HAREM:

Funding

Programa Operacional Sociedade do Conhecimento Fundo Europeu de Desenvolvimento Regional (FEDER) UMIC - Agência para a Sociedade do Conhecimento FCCN - Fundação para a Computação Científica Nacional FCT - Fundação para a Ciência e a Tecnologia

Last update: 6 May 2010.
Contact us