The initial WP4 prototype and its components
D4.2 reports on the development of the initial WP4 prototype
D4.2 reported on the development of the initial WP4 prototype, which consists of tools for monolingual and bilingual focused web crawling. The Focused Monolingual Crawler is a Best-First crawler including a text to topic classifier that decides whether a visited web page is relevant to a predefined domain. The Focused Bilingual Crawler goes one step further in mirroring the structure of multilingual sites and trying to detect relevant documents that are translations of each other. The two crawlers, which include modules for language identification and boilerplate removal, have been integrated as web services in the PANACEA platform, complying with the solution path detailed in D4.1.
Please refer to the pdf version of D4.2
for more detailled information.