| Name |
Notes |
Citation |
Url |
|
Europarl
|
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. Version 3 includes data through 10/2006.
|
Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005
|
http://www.statmt.org/europarl/
|
|
Hunglish
|
The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 54.2 m words in 2.07 m sentences. The corpus was created as part of the hunglish project by the joint work of the Media Research and Education Center at the Budapest University of Technology and Economics and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics.
|
|
http://mokk.bme.hu/resources/hunglishcorpus/index_html
|
|
JRC-Acquis
|
The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable in the the EU Member States. The Acquis Communautaire texts exist in 22 languages.
|
Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga: "The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages", LREC 2006.
|
http://langtech.jrc.it/JRC-Acquis.html
|
|
News Commentary
|
From the WMT ACL 2007 Shared Task, 35-40 million words per language, collected from news commentaries online. EN, DE, FR, CZ, ES.
|
|
http://www.statmt.org/wmt07/shared-task.html
|