Start date: 1.2.2016

Project duration: 18 months (with a possible extension of 6 months)

 

 
Description of work

The methodology developed within PRESEMT has focussed on developing Machine Translation systems by minimising the requirements for specialised linguistic resources. Consequently, PRESEMT is based mainly on large monolingual corpora of texts to model the target language, together with a very small corpus of parallel sentences (of the order of 200 sentences, working at a phrase-level (phrase-based MT). PRESEMT is phrase-based (where phrases are syntactically motivated) and it implements the translation process in a two-phase sequence, as depicted in the following figure.



The aim of the current project is to improve the translation performance in relatively mature language pairs, with emphasis on pairs involving the Greek language. This improvement will be achieved by extracting more accurate models from raw data, and integrating them for use in the translation process. Based on the results of the PRESEMT and POLYTROPON projects, the main priority is to optimise the extraction of information, from the small parallel corpus, which determines the required modifications in the sentence structure when transitioning from the source to the target language.

To this end, it is useful to determine the degree of parallelism at a sentence pair level and then dynamically select the optimal subset of parallel sentences for the translation system. Based on this new corpus, computational intelligence techniques will be used to extract the appropriate model for transitioning from one language to another. For the systematic optimisation of the system parameters, metaheuristic functions will be used.

In a related research direction, changes in the Phrase Aligner Module will be investigated, to improve performance when handling special difficult-to-handle cases. Alternative phrasing methods to CRF (Conditional Random Fields, which is the default choice for PRESEMT) will be investigated to determine phrases (PMG module).

For establishing language models from monolingual corpora, the existing handling via indexed files is considered to be effective. However, experiments within the Polytropon project demonstrated the possibility to improve the final translation by employing n-gram models. For this reason, a second line of research is proposed which involves the optimal extraction of information from monolingual corpora using statistical methods (n-grams) via S/W packages such as SRI. This research is expected to improve the accuracy of the second translation phase of PRESEMT.