CETEMPúblico: a large corpus of Portuguese newspaper language

More detailed information in Portuguese

Linguateca, the Computational processing of Portuguese follow up
CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) is a corpus containing some 180 million words in European Portuguese, created by the Computational processing of Portuguese project after an agreement between the Portuguese Ministry for Science and Technology (MCT) and the Portuguese daily newspaper PÚBLICO was signed in April 2000.

Its first version, CETEMPúblico 1.0, came to existence on the 25th July 2000. See the associated Readme file.

We make the corpus available in the following different ways:


FAQ - Frequently Asked Questions

Who are the envisaged users of CETEMPúblico?

This corpus was mainly aimed at all those who develop computer programs processing the Portuguese language, and who would need raw material for their work. The text versions on CD were conceived for this kind of users.

On the other hand, we want the corpus to be useful to everyone who studies the Portuguese language and wishes to check their hypotheses in previously organized text material. The online and the CQP versions are meant for such users, who are, in any case, also welcome to get it on CD in order to process the corpus locally, possibly by means of the corpus processing system of their choice.

What is PÚBLICO?

PÚBLICO is a widely read daily Portuguese newspaper. It was founded in 1991 and was the first newspaper in Portugal to make available an online edition on the Web, Publico.pt.

Are there any restrictions to the use of CETEMPúblico?

As stated in the User Conditions file, CETEMPúblico can be used for research and technological development. Only its direct commercial exploitation is not allowed.

What are my duties as a user of CETEMPúblico?

The Público newspaper should always be acknowledged as source of the material, in any presentation of work that make use of CETEMPúblico, such as articles, theses and talks.

A free copy of any commercial products emerging from R&D projects using CETEMPúblico should be given to the PÚBLICO newspaper.

Am I allowed to reconstruct the full newspaper articles?

No. The agreement signed between MCT and PÚBLICO forced us to chop up the articles into extracts and shuffle them so that no reconstruction were possible. The corpus is not supposed to replace the newspaper's archives.

Does CETEMPúblico include all the text published by PÚBLICO?

No. On the one hand, several editions were missing in the material provided by the newspaper, and we excluded newspaper sections not considered relevant for the goals of the corpus, such as quotations from other Portuguese newspapers ("Diz-se"), the errata section ("O PÚBLICO errou"), and sports results in table format (classifications, rankings, results, etc.). On the other hand, CETEMPúblico includes a large number of articles that were not actually published by lack of space or opportunity.

Is the language of CETEMPúblico exclusively European Portuguese?

The vast majority is Portuguese from Portugal, although there are a few texts of Brazilian and African writers.

What is included in CETEMPúblico?

The corpus includes the text of around 2,600 editions of PÚBLICO, written (stored) between 1991 and 1998, amounting to approximately 180 million words.

CETEMPúblico 1.7 contains 1,504,258 extracts (CETEMPúblico 1.0 had 1,567,625), bearing the information about section of origin and semester. Each extract is divided in paragraphs and sentences, and titles and authors are marked as such. See some examples of extracts.

How were the words counted?

Tokens containing at least one letter or digit were considered words. Punctuation marks were not considered words.

Some approximate numbers (computed 2001):

Tokens Types
Units229,038,019 1,033,041
Words 191,687,833 999,059
Punctuation 13,065,151 33,982

"Punctuation" includes tokens with punctuation marks, such as (1993), a) or 17:53.

StructureNumber
Extracts <ext> 1,504,258
Paragraphs <p> 2,571,735
Sentences <s> 7,082,094
Titles <t> 655,059
Authors <a> 247,392
List elements <li> 80,060

The list of tokens in CETEMPúblico is available from the AC/DC project pages (word list, lemma list).

Further quantitative information is also available from the quantitative description page, that is updated for each AC/DC corpus when changes in the programs occur.

What is the corpus structure?

We specify the (non-annotated) corpus structure with the help of a small BNF grammar. Terminals appear in bold:

corpus = <corpus> extract+ </corpus>
extract = extract_id extract_contents </ext>
extract_contents = paragraph+
paragraph = title | author_id | <p> sentence+ </p> | list_element
title = <t> token+ </t>
author_id = <a> token+ </a>
list_element = <li> token+ </li>
sentence = ( <s> | <s tipo=frag> ) token+ </s>
token = | palavra | sinal_pontuação | identificador
X = ( *+ ) | *+
extract_id = <ext n=number sec=sec_id sem=semester >
number = [0-9]+
sec_id= soc | pol | clt | des | opi | eco | com | clt-soc | pol-soc | nd
semester = 91a | 91b | 92a | 92b | 93a | 93b | 94a | 94b | 95a | 95b | 96a | 96b |97a | 97b | 98a | 98b

Notes:

Alternatively, we provide a DTD for SGML parsers.

Do the characters strictly reflect newspaper usage?

In some cases we made normalization decisions (the original material was encoded in Macintosh characters, while we chose the ISO-8859-1 character encoding standard). Some of the changes performed are:

Is all material included in CETEMPúblico in a valid format?

Although this was not the case with the previous versions, we have checked that this is true as far as version 1.7 is concerned.

Are there other known problems in CETEMPúblico?

See also our ACL'2001 paper (see below) for precision and recall on structural markup concerning titles, author identification and sentence separation.

What is "CETEMPúblico's first million" (primeiro milhão do CETEMPúblico)?

As the name indicates, it is the first subset of CETEMPúblico (the first million words), which was created under our treebank project, Floresta Sintá(c)tica, and whose sentence separation was manually revised and redone (in what concerned text including semicolon, colon and parentheses). It does not only include earlier text (1991), rather it should feature a balanced selection of years 1991 through 1999 as well as all categories included in the full corpus.

Access to this first million (also annotated) is being given through our AC/DC project.

What is the annotated CETEMPúblico (CETEMPúblico anotado)?

As for all other corpora of the AC/DC project, we have annotated CETEMPúblico with the PALAVRAS parser developed by Eckhard Bick. Due to its size, the actual annotation was actually done in Eckhard Bick's VISL project premises and not in Linguateca.

Currently users can query the annotation done in 2006 through the AC/DC project interface. Note that, due to efficiency problems, you are strongly advised to use a cut clause in their concordance queries, like in [word="como" & pos="V.*"] cut 20.

Annotated CETEMPúblico of 2006 is also available for download. To get the access information please register in the Portuguese page.

Is there more information about CETEMPúblico?

You can read more about this corpus in two articles, available here in electronic form:

How can I remain updated about future CETEMPúblico changes?

Whenever we learn about new problems with the corpus, we try to create patches to solve them. They will be available from CETEMPúblico's page. We will also update the corpus version to which we give access on the Web. So far (for users of version 1.0), we have made available 6 patches in Perl, named patch_cetempublico_1.0.x.pl that may be downloaded from the information page.

In order to remain updated about the corpus progress, you can also subscribe to the CETEMPúblico mailing list by sending us a message. Note: You don't need to explicitly subscribe to this list if you ordered the corpus through us, because your registering to get a copy leads to your inclusion in this list.


Acknowledgements


Last update: 10 September 2007.
Send questions, comments and suggestions