Conditional Random Fields versus template-matching in MT phrasing tasks involving sparse training data

Hits: 4511
Research areas: Year: 2015
Type of Publication: Article Keywords: Parsing of natural language; Template-matching; Conditional-random fields; Phrasing model generator; Machine translation
Journal: Pattern Recognition Letters Volume: 53
Pages: 44-52
This communication focuses on comparing the template-matching technique to established probabilistic approaches – such as conditional random fields (CRF) – on a specific linguistic task, namely the phrasing of a sequence of words into phrases. This task represents a low-level parsing of the sequence into linguistically-motivated phrases. CRF represents the established method for implementing such a data-driven parser, while template-matching is a simpler method that is faster to train and operate. The two aforementioned techniques are compared here to determine the most suitable approach for extracting an accurate model. The specific application studied is related to a machine translation (MT) methodology (namely PRESEMT), though the comparison performed holds for other applications as well, for which only sparse training data are available. PRESEMT uses small parallel corpora to learn structural transformations from a source language (SL) to a target language (TL) and thus translate input text. This results in the availability of only sparse training data from which to train the parser. Experimental results indicate that for a limited-size training set, as is the case for the PRESEMT methodology, template-matching generates a superior phrasing model that in turn generates higher quality translations. This is confirmed by studying more than one source/target language pairs, for multiple independent testsets.