PRESEMT

Building a 70 billion word corpus of English from ClueWeb

Research areas:	Corpus Linguistics	Year:	2012
Type of Publication:	In Proceedings	Keywords:	corpus, clueweb, English, encoding, word sketch
Authors:	37, 45 42

Book title:	Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC2012)
Pages:	502-506
Address:	Istanbul, Turkey
Organization:	LREC2012	Month:	May 23-25


Abstract:	This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.
JRESEARCH_FULLTEXT: PomikalekEtAl_LREC2012.pdf

Back

Top

Skip to content

Web design, realisation, maintenance and administration by Marina Vassiliou
Logo design and realisation by Zacharias Detorakis
The research leading to these results has received funding from the European Community's
Seventh Framework Programme (FP7/2007-2013) under grant agreement No 248307.

PRESEMT

Building a 70 billion word corpus of English from ClueWeb

Results

Links

Login Form

The PRESEMT book