2

I would like to know how many documents or sentences or words I need to process in order to get a good language model of a domain and use it in voice recognition tools such as CMU Sphinx.

Charles
  • 50,943
  • 13
  • 104
  • 142
pjvv1
  • 21
  • 2

2 Answers2

2

To create a decent language model for a small domain it's usually enough to have about 100 mb of texts. You can mix them with a generic language model to get a better generalization of the language model.

To create a generic language model developers use very big corpora. For example there is a Google 1TB corpus which contains millions of words and terabyte of data. The trigram part of it is about 40Gb of bigram counts but it must be a hundred terabytes of texts.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • Where can I download this 1TB corpus? – A T Mar 03 '12 at 16:45
  • Google data is available to buy from LDC. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 See also http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html – Nikolay Shmyrev Mar 03 '12 at 17:03
0

adding to Nikolay's answer:

This is not a trivial task. Generating a language model is a time- and resource-intensive task.

If you want to have a "good" language model, you will need a large or very large text corpus to train a language model (think in the order of magnitude of several years of wall street journal texts).

"good" means: if the language model will be able to generalize from the training data to new and previously unseen input data

You should look at the documentation of the Sphinx and the HTK language model toolkits.

Please check these two threads:

Building openears compatible language model

Ruby Text Analysis

You could take a more general Language Model, based on a bigger corpus and interpolate your smaller Language Model with it .. e.g a back-off language model ... but that's not a trivial task.

see: http://en.wikipedia.org/wiki/Katz's_back-off_model

Community
  • 1
  • 1
Tilo
  • 33,354
  • 5
  • 79
  • 106