First evaluation cycle reports are positive

11 Apr 2011
Consisting of the validation of the first version of the platform, intrinsic evaluation of the monolingual crawling component, extrinsic evaluation of the bilingual crawling component and establishes the MT baselines.
The platform validation results are good. Some minor problems, as well as some issues that will become important in subsequent versions, have been identified and anticipated so that they can be addressed and improved in the second version.

The intrinsic evaluation of the monolingual crawlers indicated that the crawling component performs well in collecting relevant domain or topic specific documents. It also shows that using a BF algorithm for focused crawling leads to improvements beyond the baseline obtained by the default BRF algorithm integrated in the crawling component integrated in the platform. The total error rate in (domain) classification: 7.08% meaning that only 7% of the crawled data not relevant for the chosen domain/topic. The evaluation also evidenced some issues related to cleaning and normalization of the crawled pages.

For the first cycle, extrinsic evaluation in MT focused on: a) in-domain parallel development data and b) in-domain monolingual training data. The baseline system used for comparison consists of the MaTrEx system trained on Europarl data (i.e. general domain data). Overall the results show that the use of PANACEA-produced domain data (i.e. the crawled corpora) proves important for parameter optimization, leading to a 32% improvement in term of BLEU.