Efficient N-gram Language Modeling for Billion Word Web-Corpora

Hits: 2237
Research areas: Year: 2012
Type of Publication: In Proceedings
  • 39, 31
Book title: Proceedings of the workshop 'Challenges in the Management of Large Corpora' (CMLC) [held in conjunction with LREC2012]
Pages: 6-12
Address: Istanbul, Turkey
Organization: CMLC [held in conjunction with LREC2012] Month: May 22
Building higher-order n-gram models over 10s of GB of data poses challenges in terms of speed and memory; parallelization and processing efficiency are necessary prerequisites to build the models in feasible time. The paper describes the methodology developed to carry out this task on web-induced corpora within a project aiming to develop a Hybrid MT system. Using this parallel processing methodology, a 5-gram LM with Kneser-Ney smoothing for a 3Bn word corpus can be built in half a day. About half of that time is spent in the parallelized part of the process. For a serial execution of the script, this time usage would have had to have been multiplied by 250 (corresponding to close to two months of work).
JRESEARCH_FULLTEXT: Bungum-Gamback_CMLC2012.pdf