PRESEMT

Efficient N-gram Language Modeling for Billion Word Web-Corpora

Research areas:	Corpus Modelling	Year:	2012
Type of Publication:	In Proceedings
Authors:	39, 31

Book title:	Proceedings of the workshop 'Challenges in the Management of Large Corpora' (CMLC) [held in conjunction with LREC2012]
Pages:	6-12
Address:	Istanbul, Turkey
Organization:	CMLC [held in conjunction with LREC2012]	Month:	May 22


Abstract:	Building higher-order n-gram models over 10s of GB of data poses challenges in terms of speed and memory; parallelization and processing efficiency are necessary prerequisites to build the models in feasible time. The paper describes the methodology developed to carry out this task on web-induced corpora within a project aiming to develop a Hybrid MT system. Using this parallel processing methodology, a 5-gram LM with Kneser-Ney smoothing for a 3Bn word corpus can be built in half a day. About half of that time is spent in the parallelized part of the process. For a serial execution of the script, this time usage would have had to have been multiplied by 250 (corresponding to close to two months of work).
JRESEARCH_FULLTEXT: Bungum-Gamback_CMLC2012.pdf

Back

Top

Skip to content

Web design, realisation, maintenance and administration by Marina Vassiliou
Logo design and realisation by Zacharias Detorakis
The research leading to these results has received funding from the European Community's
Seventh Framework Programme (FP7/2007-2013) under grant agreement No 248307.

PRESEMT

Efficient N-gram Language Modeling for Billion Word Web-Corpora

Results

Links

Login Form

The PRESEMT book