How do I build a large-vocabulary language model for CMU Sphinx?

Question

I would like to build a language model for CMU Sphinx, but my corpus has more than 1000 words so I cannot use the online tool. How do I use (the scripts in cmuclmtk?) to build my language model?

score 6 · Accepted Answer · answered Jan 24 '11 at 19:20

6

Please read the tutorial

http://cmusphinx.sourceforge.net/wiki/tutoriallm

answered Jan 24 '11 at 19:20

Nikolay Shmyrev

24,897
5
43
87

That document was very helpful with the exception of 'Generating a dictionary'. Does the distribution come with a script to generate that dictionary? – joeforker Jan 24 '11 at 19:25
You can use pronounce tool which you can checkout from subversion http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/trunk/logios/Tools/MakeDict/ There are external g2p packages like http://code.google.com/p/phonetisaurus/ or sequitur-g2p, they also can be used. – Nikolay Shmyrev Jan 24 '11 at 21:28
It appears pocketsphinx has a dictionary in the en_US directory, right next to the models. I'm going to try using that one. – joeforker Jan 26 '11 at 21:40
hi Nikolay,Currently am having a large text file which contains around 11k words,can you please tell me the exact command which can generate .lm and .dic/.dmp files from that text file.thanks in adv. – ravoorinandan Sep 08 '11 at 07:51
1

Hello ravoorinandan. You can find exact commands (text2wfreq, text2idngram, idngram2lm) in the tutorial above. – Nikolay Shmyrev Sep 14 '11 at 09:42
Thanks a lot for your reply Nikolay.Right now using the documentation available i had created .binlm and .arpa files from the corpus text file.so currently i dont know how to use them in my application.i mean what is the key we need to provide while giving arpa format as input apart from .lm or .DMP. – ravoorinandan Sep 17 '11 at 09:34
And can you please let me know how to create a dictionary plz.Thanks a lot for your help. – ravoorinandan Sep 17 '11 at 09:35
i had checked out with the link you have provided above(cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/trunk/logios/)but am not able to use create the file with .dic extention.from the Corpus.txt file. – ravoorinandan Sep 19 '11 at 10:55
Logios is not the only tool, you can find a references to others in tutorial. In case of troubles it's always recommended to provide more details on them. Since you don't describe where you fail it's hard to suggest you anything. – Nikolay Shmyrev Sep 20 '11 at 00:44
yes thanks a lot for your fast response nikolay.i will come up with the error details and let you know.mean while i do have one more doubt,can we convert .DMP file into .lm.write now am using Sphinx_lm_convert to convert arpa into DMP. – ravoorinandan Sep 20 '11 at 06:35
And am creating a dictionary file(.dict format) by uploading the Corpus text file in the following link http://www.speech.cs.cmu.edu/tools/lextool.html so that it provides me with .dict and .word files can i use them in mine? – ravoorinandan Sep 20 '11 at 06:38
Hello. You can convert from DMP to ARPA with sphinx_lm_convert too. Lextool web service is essentially logios installed as a webservice. You can checkout and install it in your home machine. You can try other package too. – Nikolay Shmyrev Sep 20 '11 at 19:33
yes nikolay i am aware of that,but the problem is does this DMP file works for sphinx-II decoder? i mean in vocal kit???because nothing is happening when am using the .dic and .DMP files in my voice search.(i mean its returning null value). – ravoorinandan Sep 21 '11 at 11:54
DMP should work with vocalkit. If you have some specific issues you could debug it. – Nikolay Shmyrev Sep 21 '11 at 18:45

score 1 · Answer 2 · edited May 23 '17 at 12:17

Not a trivial task. Generating a language model is a time- and resource-intensive task.

If you want to have a "good" language model, you will need a large or very large text corpus to train a language model (think in the order of magnitude of several years of wall street journal texts).

"good" means: if the language model will be able to generalize from the training data to new and previously unseen input data

You should look at the documentation of the Sphinx and the HTK language model toolkits.

http://cmusphinx.sourceforge.net/wiki/tutoriallm

Also check these two threads:

Building openears compatible language model

Ruby Text Analysis

You could take a more general Language Model, based on a bigger corpus and interpolate your smaller Language Model with it .. e.g a back-off language model ... but that's not a trivial task.

see: Katz's back-off model

How do I build a large-vocabulary language model for CMU Sphinx?

2 Answers2