1

I am trying to create a speech recognition system for Sinhalese language. I tried to create a language model but following the answer in Build NEW Acoustic model, Dictionary , Language model for uncommon language speech recognition .I used both online lmtool and cmuclmtk-0.7-win32 on windows.My input file as follows,

එක  eka
දෙක de ka
තුන thu na
හතර ha tha ra
පහ  pa ha
හය  ha iya
හත  ha tha
අට  ah ta
නවය na wa ya

After submitting to lmtool and cmuclmtk i got the output as follows,

AHTA    AE T AH
DEKA    D AH K AA
EKA EH K AH
HAIYA   HH EY AY AH
HATHA   HH AE TH AH
HATHARA HH AE TH AH R AH
NAWAYA  N AO EY AH
PAHA    P AE HH AH
THUNA   TH UW N AH
à¶…à¶§  
තුන   
දෙක   
නවය   
à¶´à·„  
à·„à¶­  
à·„à¶­à¶»   
හය  
එක   

both .dic and .lm files contains above characters. I feel these are some garbage characters. what did i do wrong to get this?

Community
  • 1
  • 1
dab1984
  • 47
  • 6
  • The erroneous file looks vaguely like utf-8 viewed with a legacy 8-bit encoding, or possibly incorrectly recoded into utf-8 from what was erroneously specified as an 8-bit encoding. Without access to the raw bytes, we can't really tell. Check the [`character-encoding` tag wiki](http://stackoverflow.com/tags/character-encoding/info) for some background and diagnostics hints. – tripleee Jun 30 '15 at 11:34

1 Answers1

1

You did everything wrong.

For corpus construction you need a text file, not a dictionary file. You create dictionary separately.

You should not use online lmtool for your language. It works for English only.

To train language model from texts you should use srilm.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • I used SRILM and language file in text format but still the same result. I followed thi tutorial http://www.cs.brandeis.edu/~cs114/CS114_docs/SRILM_Tutorial_20080512.pdf It's for Chinese. why I get those garbage characters? is their any FONT issue in my PC? or Isn't SRILM support sinhala language? – dab1984 Jul 03 '15 at 08:49
  • You can share your files so I can take a look. Without files it is hard to help you. – Nikolay Shmyrev Jul 03 '15 at 09:49
  • Text File I used to create LM : http://s000.tinyupload.com/?file_id=34268100379759743452 SRILM Generated File: http://s000.tinyupload.com/?file_id=43528215708733597235 The Command I Used in Cygwin: ./ngram-count -text sinhala.txt -order 3 -write NPFEOT0001.count -unk My OS win8.1 64 bit – dab1984 Jul 03 '15 at 10:26
  • File looks correct, not sure why do you think characters are garbled. You need to use good editor that supports UTF-8 to view files, for example Notepad++ – Nikolay Shmyrev Jul 03 '15 at 12:44
  • Awesome Notepad++ did the trick. I am using Notepad++ as my default editor from now on. – dab1984 Jul 06 '15 at 03:06