seventh framework
  • Castellano
  • Français
  • English
  • Deutsch
  • Italiano
  • Ελληνικά

WP8 defines validation and evaluation scenario

28 Oct 2010

WP8 defines a specific use case for validation and evaluation: The adaptation of an MT system to a specific, specialized domain.

Included is a description of the real-life case scenario which will be implemented in order to validate and evaluate the web service system. The evaluation hopes to demonstrate a significant reduction in cost and time reduction in the production of LRs. Details, requirements and implications of the scenario are included and elaborated upon, in detail.

Validation and Evaluation scenario

The standard real-life use case will be that users of PANACEA will already have resources available, and want to update / merge them with new material. So, while the first set of services would be relevant to create them, in later development phases, services to compare and merge resources will become important.
PANACEA WP 8 defines a specific use case for evaluation, which is the adaptation of an MT system to a specific / specialized domain.
This a very complex use case, however is does not cover all PANACEA tools, nor all PANACEA languages. In turn, it has practical relevance, as the production of MT systems is one of the major industrial applications of language technology.

Details and requirements for this use case are given below (Chapter 3). It will imply:

  • Web crawling, in search for a corpus of parallel documents for a particular special domain
  • Normalization of these documents; removal of boilerplates, normalization of character codes, hyphenation, etc.
  • Sentence segmentation, breakdown of texts into sentences
  • Sentential alignment; handling of non-alignable segments
  • Tokenization, both for RMT and SMT usage
    With this toolset, the input for one type of MT systems can be generated.

For RMT systems, additional tools are required which produce the glossaries describing the domain-specific terminology. These tools are:

  • Monolingual term extraction, identifying the source terms (both single and multiword)
  • Monolingual term annotation, producing the entry annotations required by the MT system; both for the source and later for the target side entries
  • Bilingual term extraction, identifying translation candidates for a given source term
  • Bilingual term annotation which defines transfer conditions for lexical selection in case of for 1:n translations
  • Glossary term input, to merge the domain specific terminology with the already existing terms
  • Named entity recognition for proper names, which must be protected from being translated, or added to the dictionaries as proper names

In a factory-like workflow, these tools should be concatenated in (maybe two) series of workflows, to be called ‘General-MT-adaptation’ and ‘RMT-adaptation’ respectively.

It should be noted that some of these tools, like term extraction, named entity recognition etc., themselves can be workflows, consisting of several elementary steps like dictionary lookup, tagging etc.

*This information can also be found at blog where you can also find links to upcoming conferences, progress reports and the thoughts, opinions and commentaries from project members.*