The PRESEMT project constitutes a novel approach to Machine Translation, characterised by the use of (a) cross-disciplinary techniques, mainly borrowed from the machine learning and computational intelligence domains, and (b) relatively inexpensive language resources. The aim is to develop a language-independent methodology for the creation of a flexible and adaptable MT system, the features of which ensure easy portability to new language pairs or adaptability to particular user requirements. PRESEMT falls within the Corpus-based MT (CBMT) paradigm. The resources employed, a small bilingual corpus and a large target language (TL) monolingual one, are collected as far as possible over the web, to simplify the development of resources for new language pairs.

The key aspects of PRESEMT involve modelling based on syntactic phrases, as they have been proven to improve translation quality, pattern recognition approaches (such as extended clustering or neural networks) towards the development of a language-independent analysis and evolutionary algorithms for system optimisation.

PRESEMT has a duration of 3 years. The work plan is analysed into 9 work packages relating to five aspects, namely project management (WP1), dissemination activities (WP8), system specifications (WP2), system development & integration (WP3 – WP7) and validation & evaluation (WP9).

The language pairs studied are given below:

  • Czech --> English & German

  • English--> German

  • German --> English

  • Greek --> English & German

  • Norwegian --> English & German

Near the end of the project an assessment phase is scheduled, where additional language pairs will be investigated, with Italian as the target language.

Architecture

The PRESEMT system comprises 3 stages, each of which has a modular structure:

1. Pre-processing stage: It involves the compilation of resources needed for the MT system to perform, i.e. the collection and appropriate annotation of corpora, the elicitation of phrasing information as well as the extraction of semantic and statistical data.

2. Main translation engine: This component, being the core part of the system, translates a source language (SL) text to a target language (TL) one, drawing, in stepwise mode, on the information obtained in the Pre-processing stage.

3. Post-processing stage: This stage offers the user the opportunity to modify the system translation output according to their preferences. These modifications can then be endorsed by the system so as to adapt itself to the given input.

 

Pre-processing stage:

4 modules

Main Translation Engine:
3 modules

Post-processing stage:
  2 modules

Corpus creation & annotation module

Structure selection module

Post-processing module

Phrase aligner module

Translation equivalent selection module

Phrasing model generator

Optimisation module

User adaptation module

Corpus modelling module

 

PRESEMT system architecture

Milestones

WP2

  • MS1: Definition of system specifications
  • MS2: Evaluation set-up
  • MS3: Selection of language pairs

 

WP3

  • MS4: Corpus creation & annotation module (ver.1)
  • MS5: Phrase aligner module (ver.1)
  • MS6: Corpus modelling module (ver.1)
  • MS7: Corpus creation & annotation module (ver.3)
  • MS8: Phrase aligner module (ver.2)
  • MS9: Corpus modelling module (ver.2)

 

WP4

  • MS10: Structure selection module (ver.2)
  • MS11: Optimisation module 1

 

WP5

  • MS12: Translation equivalent selection module (ver.2)
  • MS13: Optimisation module 2

 

WP6

 

  • MS14: Post-processing module
  • MS15: User adaptation module

 

WP7

  • MS16: PRESEMT Prototype (ver.1)
  • MS17: PRESEMT Prototype (ver.2)
  • MS18: PRESEMT Final Prototype

 

WP8

  • MS19: Planning dissemination activities

 

WP9

  • MS20: 1st Evaluation/Validation Round
  • MS21: 2nd Evaluation/Validation Round
  • MS22: Extension to other language pairs exercise

Deliverables

WP7

D7.1.1: PRESEMT Prototype (ver.1)

D7.1.2: PRESEMT Prototype (ver.2)

D7.2.1: PRESEMT System documentation (ver.1)

D7.2.2: PRESEMT System documentation (ver.2)

D7.2.3: PRESEMT System documentation (ver.3)

D7.3.1: User manual (ver.1)

D7.3.2: User manual (ver.2)

D7.3.3: User manual (ver.3)

D7.4: PRESEMT Final Prototype

WP9

D9.1: 1st Report on system validation & evaluation

D9.2: 2nd Report on system validation & evaluation

D9.2: 2nd Report on system validation & evaluation [Supplement]

D9.3: System assessment

D9.3: System assessment [Resubmission]

Work packages

WP1

Management

This WP covers both the administrative aspect of the PRESEMT project, i.e. coordinating activities, monitoring of work progress, reporting to the community, managing of financial aspects etc., as well as the technical one, namely, monitoring of technical issues, work quality, technical decisions to be made etc.

Deliverables

WP2

System specifications

The current WP involves defining the guidelines, on the basis of which PRESEMT will be developed, i.e. defining the specifications of the system prototype and deciding upon the modules which this prototype will comprise. Furthermore, the consortium will identify the data and test suites required for validating and evaluating the PRESEMT prototype.

Deliverables

WP3

Corpus extraction & processing algorithms

WP3 involves the development of three modules of the PRESEMT prototype: (a) the Corpus creation & annotation module, released in 3 different versions, which will be responsible for the collection of resources over the web and their appropriate annotation, (b) the Phrase aligner module, released in 2 different versions, which, by consulting a small parallel corpus, will automatically define phrasing models in a given language pair, and (c) the Corpus modelling module, released in 2 different versions, which will identify semantic relations between words.

Deliverables

WP4

Structure selection

WP4 involves the development of the module, released in 2 different versions, which will handle the first phase of the translation process. Furthermore, WP4 involves the optimisation of the parameters of the specific module.

Deliverables

WP5

Translation equivalent selection

WP5 involves the development of the module, released in 2 different versions, which will handle the second phase of the translation process. Furthermore, WP5 involves the optimisation of the parameters of the specific module.

Deliverables

WP6

Post-processing & User adaptation

WP6 involves the development of two modules, namely (a) the Post-processing module, via which the end user will be able to correct the system output, and (b) the User adaptation module, where the focus is to make the system ‘learn’ from the user’s modifications.

Deliverables

WP7

Integration

Within WP7 the various modules developed in the previous WPs will be integrated into one prototype, issued in 3 subsequent versions, while the performance of the prototype will be enhanced through parallelisation processes. Furthermore, all versions of the system prototype will be accompanied by the respective documents, comprising system documentation and user manuals.

Deliverables

WP8

Dissemination

The current work package involves the development of a dissemination and exploitation strategy to be followed during the project lifecycle together with the relevant activities instantiating the aforementioned strategy.

Deliverables

WP9

Validation & Evaluation

WP9 encloses all the experimental activities to be performed with the purpose of (a) validating the system prototypes in terms of technical requirements and evaluating its performance in terms of translation quality. The validation and evaluation experiments, both consortium-internal and consortium-external, are planned to take place twice during the project lifecycle, following the issuing of the two versions of the system prototype.

The language pairs to be studied and used for evaluation purposes are the following:

  • {Czech, Greek, German, Norwegian} --> English
  • {Czech, Greek, English, Norwegian} --> German

Besides these activities, the consortium also plans to assess the system extensibility and portability to new languages, by applying the 2nd system prototype to other language pairs (cf. the following list), different from those used for the system development. The outcome of this task will also contribute to the issuing of the PRESEMT final prototype.

  • {Czech, Greek, German, English, Norwegian} --> Italian

Deliverables

Workplan

PRESEMT will have a 36-month duration. The proposed work plan for reaching the project objectives is analysed into nine (9) work packages relating to five aspects, namely project management (WP1), dissemination activities (WP8), system specifications (WP2), system development & integration (WP3-WP7) and validation & evaluation (WP9).

Within PRESEMT an iterative development approach will be followed, concerning both the individual system modules and the system as a whole. This approach entails the creation of intermediate system prototypes, which will incorporate the results of the repetitive application of validation and evaluation activities. This will allow, to a great extent, to effectively address any critical issues that may emerge during development and to adopt well-planned solutions.

Within the timeframe proposed, broadly two development phases have been planned, each of them resulting in a system prototype (PRESEMT Prototype (ver.1) & PRESEMT Prototype (ver.2)). Both prototypes will be developed in accordance to the design principles and specifications defined in WP2.

The first system prototype, due on month 19, will include the first versions of the modules developed in WP3-WP6, and will be subsequently validated & evaluated in terms of performance and translation quality. The testing results will be fed back into the module development process to support the system improvement as it proceeds towards the second prototype.

The second system prototype, due on month 26, will include the final versions of the aforementioned modules, while parallelisation of processing will have been completed as well. Then, the second validation & evaluation iteration will take place to check the efficiency of the improvements performed.

The second testing iteration will be further enhanced via an assessment / experimentation phase, when the handling of other language pairs by the system will be investigated, leading to the final system prototype (PRESEMT Final Prototype) at the end of the project lifetime.