The PRESEMT project constitutes a novel approach to Machine Translation, characterised by the use of (a) cross-disciplinary techniques, mainly borrowed from the machine learning and computational intelligence domains, and (b) relatively inexpensive language resources. The aim is to develop a language-independent methodology for the creation of a flexible and adaptable MT system, the features of which ensure easy portability to new language pairs or adaptability to particular user requirements. PRESEMT falls within the Corpus-based MT (CBMT) paradigm. The resources employed, a small bilingual corpus and a large target language (TL) monolingual one, are collected as far as possible over the web, to simplify the development of resources for new language pairs.
The key aspects of PRESEMT involve modelling based on syntactic phrases, as they have been proven to improve translation quality, pattern recognition approaches (such as extended clustering or neural networks) towards the development of a language-independent analysis and evolutionary algorithms for system optimisation.
PRESEMT has a duration of 3 years. The work plan is analysed into 9 work packages relating to five aspects, namely project management (WP1), dissemination activities (WP8), system specifications (WP2), system development & integration (WP3 – WP7) and validation & evaluation (WP9).
The language pairs studied are given below:
-
Czech --> English & German
-
English--> German
-
German --> English
-
Greek --> English & German
-
Norwegian --> English & German
Near the end of the project an assessment phase is scheduled, where additional language pairs will be investigated, with Italian as the target language.
Architecture
The PRESEMT system comprises 3 stages, each of which has a modular structure:
1. Pre-processing stage: It involves the compilation of resources needed for the MT system to perform, i.e. the collection and appropriate annotation of corpora, the elicitation of phrasing information as well as the extraction of semantic and statistical data.
2. Main translation engine: This component, being the core part of the system, translates a source language (SL) text to a target language (TL) one, drawing, in stepwise mode, on the information obtained in the Pre-processing stage.
3. Post-processing stage: This stage offers the user the opportunity to modify the system translation output according to their preferences. These modifications can then be endorsed by the system so as to adapt itself to the given input.
Pre-processing stage: 4 modules |
Main Translation Engine: |
Post-processing stage: |
Corpus creation & annotation module |
Structure selection module |
Post-processing module |
Phrase aligner module |
Translation equivalent selection module |
|
Phrasing model generator |
Optimisation module |
User adaptation module |
Corpus modelling module |
PRESEMT system architecture
Milestones
- MS1: Definition of system specifications
- MS2: Evaluation set-up
- MS3: Selection of language pairs
- MS4: Corpus creation & annotation module (ver.1)
- MS5: Phrase aligner module (ver.1)
- MS6: Corpus modelling module (ver.1)
- MS7: Corpus creation & annotation module (ver.3)
- MS8: Phrase aligner module (ver.2)
- MS9: Corpus modelling module (ver.2)
- MS10: Structure selection module (ver.2)
- MS11: Optimisation module 1
- MS12: Translation equivalent selection module (ver.2)
- MS13: Optimisation module 2
- MS14: Post-processing module
- MS15: User adaptation module
- MS16: PRESEMT Prototype (ver.1)
- MS17: PRESEMT Prototype (ver.2)
- MS18: PRESEMT Final Prototype
- MS19: Planning dissemination activities
- MS20: 1st Evaluation/Validation Round
- MS21: 2nd Evaluation/Validation Round
- MS22: Extension to other language pairs exercise
Work packages
WP1
Management
This WP covers both the administrative aspect of the PRESEMT project, i.e. coordinating activities, monitoring of work progress, reporting to the community, managing of financial aspects etc., as well as the technical one, namely, monitoring of technical issues, work quality, technical decisions to be made etc.
WP2
System specifications
The current WP involves defining the guidelines, on the basis of which PRESEMT will be developed, i.e. defining the specifications of the system prototype and deciding upon the modules which this prototype will comprise. Furthermore, the consortium will identify the data and test suites required for validating and evaluating the PRESEMT prototype.
WP3
Corpus extraction & processing algorithms
WP3 involves the development of three modules of the PRESEMT prototype: (a) the Corpus creation & annotation module, released in 3 different versions, which will be responsible for the collection of resources over the web and their appropriate annotation, (b) the Phrase aligner module, released in 2 different versions, which, by consulting a small parallel corpus, will automatically define phrasing models in a given language pair, and (c) the Corpus modelling module, released in 2 different versions, which will identify semantic relations between words.
WP4
Structure selection
WP4 involves the development of the module, released in 2 different versions, which will handle the first phase of the translation process. Furthermore, WP4 involves the optimisation of the parameters of the specific module.
WP5
Translation equivalent selection
WP5 involves the development of the module, released in 2 different versions, which will handle the second phase of the translation process. Furthermore, WP5 involves the optimisation of the parameters of the specific module.
WP6
Post-processing & User adaptation
WP6 involves the development of two modules, namely (a) the Post-processing module, via which the end user will be able to correct the system output, and (b) the User adaptation module, where the focus is to make the system ‘learn’ from the user’s modifications.
WP7
Integration
Within WP7 the various modules developed in the previous WPs will be integrated into one prototype, issued in 3 subsequent versions, while the performance of the prototype will be enhanced through parallelisation processes. Furthermore, all versions of the system prototype will be accompanied by the respective documents, comprising system documentation and user manuals.
WP8
Dissemination
The current work package involves the development of a dissemination and exploitation strategy to be followed during the project lifecycle together with the relevant activities instantiating the aforementioned strategy.
WP9
Validation & Evaluation
WP9 encloses all the experimental activities to be performed with the purpose of (a) validating the system prototypes in terms of technical requirements and evaluating its performance in terms of translation quality. The validation and evaluation experiments, both consortium-internal and consortium-external, are planned to take place twice during the project lifecycle, following the issuing of the two versions of the system prototype.
The language pairs to be studied and used for evaluation purposes are the following:
- {Czech, Greek, German, Norwegian} --> English
- {Czech, Greek, English, Norwegian} --> German
Besides these activities, the consortium also plans to assess the system extensibility and portability to new languages, by applying the 2nd system prototype to other language pairs (cf. the following list), different from those used for the system development. The outcome of this task will also contribute to the issuing of the PRESEMT final prototype.
- {Czech, Greek, German, English, Norwegian} --> Italian
Workplan
PRESEMT will have a 36-month duration. The proposed work plan for reaching the project objectives is analysed into nine (9) work packages relating to five aspects, namely project management (WP1), dissemination activities (WP8), system specifications (WP2), system development & integration (WP3-WP7) and validation & evaluation (WP9).
Within PRESEMT an iterative development approach will be followed, concerning both the individual system modules and the system as a whole. This approach entails the creation of intermediate system prototypes, which will incorporate the results of the repetitive application of validation and evaluation activities. This will allow, to a great extent, to effectively address any critical issues that may emerge during development and to adopt well-planned solutions.
Within the timeframe proposed, broadly two development phases have been planned, each of them resulting in a system prototype (PRESEMT Prototype (ver.1) & PRESEMT Prototype (ver.2)). Both prototypes will be developed in accordance to the design principles and specifications defined in WP2.
The first system prototype, due on month 19, will include the first versions of the modules developed in WP3-WP6, and will be subsequently validated & evaluated in terms of performance and translation quality. The testing results will be fed back into the module development process to support the system improvement as it proceeds towards the second prototype.
The second system prototype, due on month 26, will include the final versions of the aforementioned modules, while parallelisation of processing will have been completed as well. Then, the second validation & evaluation iteration will take place to check the efficiency of the improvements performed.
The second testing iteration will be further enhanced via an assessment / experimentation phase, when the handling of other language pairs by the system will be investigated, leading to the final system prototype (PRESEMT Final Prototype) at the end of the project lifetime.