First evaluation cycle reports are positive
The intrinsic evaluation of the monolingual crawlers indicated that the crawling component performs well in collecting relevant domain or topic specific documents. It also shows that using a BF algorithm for focused crawling leads to improvements beyond the baseline obtained by the default BRF algorithm integrated in the crawling component integrated in the platform. The total error rate in (domain) classification: 7.08% meaning that only 7% of the crawled data not relevant for the chosen domain/topic. The evaluation also evidenced some issues related to cleaning and normalization of the crawled pages.
For the first cycle, extrinsic evaluation in MT focused on: a) in-domain parallel development data and b) in-domain monolingual training data. The baseline system used for comparison consists of the MaTrEx system trained on Europarl data (i.e. general domain data). Overall the results show that the use of PANACEA-produced domain data (i.e. the crawled corpora) proves important for parameter optimization, leading to a 32% improvement in term of BLEU.