Monolingual corpora in five languages and two domains
25
Apr
2011
First version of the domain-specific monolingual corpora delivered
Following the development of the initial WP4 prototype, the first version of the domain-specific monolingual corpora for English, Spanish, Italian, French and Greek was delivered as D4.3. As decided, the collection included raw text documents in the Environment and Labour Legislation domains. We reached the goal of acquiring more than 1M words for each language/domain combination by crawling html pages only. Normalization of the acquired resources included removal of boilerplate segments, paragraph segmentation and storage in an XCES-compatible encoding. Please refer to the complete text of
D4.3 for more information.
An internal deliverable was also produced including domain-specific comparable bilingual corpora in EN-FR and EN-EL.
As an internal deliverable, WP4 provided domain-specific comparable bilingual corpora for English-French and English-Greek. As decided, the collection included pairs of raw text documents in the “Environment” and “Labour Legislation” domains. The data was used in the context of WP5 for the extraction of the necessary development and test parallel sentences in the particular domains.