Listing Corpora

Name Notes Citation Url
Europarl The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. Version 3 includes data through 10/2006. Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005 http://www.statmt.org/europarl/
Hunglish The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 54.2 m words in 2.07 m sentences. The corpus was created as part of the hunglish project by the joint work of the Media Research and Education Center at the Budapest University of Technology and Economics and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics. http://mokk.bme.hu/resources/hunglishcorpus/index_html
JRC-Acquis The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable in the the EU Member States. The Acquis Communautaire texts exist in 22 languages. Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma http://langtech.jrc.it/JRC-Acquis.html
News Commentary From the WMT ACL 2007 Shared Task, 35-40 million words per language, collected from news commentaries online. EN, DE, FR, CZ, ES. http://www.statmt.org/wmt07/shared-task.html