Efficient N-gram Language Modeling for Billion Word Web-Corpora
Hits: 6833
Research areas: | Year: | 2012 | |||||
---|---|---|---|---|---|---|---|
Type of Publication: | In Proceedings | ||||||
Authors: |
|
||||||
Book title: | Proceedings of the workshop 'Challenges in the Management of Large Corpora' (CMLC) [held in conjunction with LREC2012] | ||||||
Pages: | 6-12 | ||||||
Address: | Istanbul, Turkey | ||||||
Organization: | CMLC [held in conjunction with LREC2012] | Month: | May 22 | ||||
Abstract: | Building higher-order n-gram models over 10s of GB of data poses challenges in terms of speed and memory; parallelization and processing efficiency are necessary prerequisites to build the models in feasible time. The paper describes the methodology
developed to carry out this task on web-induced corpora within a project aiming to develop a Hybrid MT system. Using this parallel processing methodology, a 5-gram LM with Kneser-Ney smoothing for a 3Bn word corpus can be built in half a day.
About half of that time is spent in the parallelized part of the process. For a serial execution of the script, this time usage would
have had to have been multiplied by 250 (corresponding to close to two months of work). |
||||||
JRESEARCH_FULLTEXT: Bungum-Gamback_CMLC2012.pdf
|