Corpus processing tools

presemt-tl-phrase-model

Pre-processing scripts for the creation of the monolingual corpus phrase model for the PRESEMT translation system.

 

presemt-phrase-aligner-module

The Phrase Aligner Module processes the bilingual corpora by performing text alignment at word and phrase level within a language pair.

 

presemt-phrase-model-generator

PMG supports two distinct operations. The first operation processes the output of Phrase Aligner Module to train a phrasing model for the SL of the specified language pair. The second operation makes use of the phrasing model established to parse any SL text input and split it into phrases in preparation for the translation process.


chart-parser

The chart-parser is an implementation of Earley's chart parsing algorithm.

  

jusText: boilerplate removal tool

jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.

 

onion: duplicate content removal tool

Onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts.

 

chared: character encoding guesser

chared is a tool for detecting the character encoding of a text in a known language. The package contains models for a wide range of languages.