Use CMU's sphinx4 to transcribe non-digits data

Question

I am recently working on using CMU's sphinx4 for transcription and eventually forced alignment, i.e. aligning audio with its transcript.

I found a project called AutoCap that basically did what I wanted to develop. So, I installed it but it did not work. I tried tweaking it but all I obtained was incorrect timestamps.

So, I thought of using sphinx4 and giving it a go myself. I successfully transcribed a wav file using Sphinx's Transcriber.jar file. But I could not get it working for an audio with non-digits data. The readme page states 'people who want to transcribe non-digits data should modify the config.xml file to use the correct grammar, language model, and linguist to do so'.

So, can anyone provide me some help on either of these :

AutoCap
Using Sphinx4 to transcribe non-digits data
Forced Alignment

Thanks.

Did you get anymore success with this project ? Will appreciate any input . — Pit Digger, Jan 11 '13 at 21:04

score 2 · Answer 1 · answered Aug 13 '11 at 14:37

There is a specific project dedicated to speech to text alignment. This is not a trivial task. The development goes in a separate sphinx4 branch. You can find some details here

http://cmusphinx.sourceforge.net/?s=long+audio+alignment

If you have any question on this project you are welcome to ask on sphinx4 forum

http://sourceforge.net/projects/cmusphinx/forums/forum/382337

score 0 · Answer 2 · answered Sep 03 '11 at 09:16

I am currently working on the same issue, i.e transcribing non digit data. I have looked briefly into the sphinx 4 programmers guide documentation, and used the language models, acoustic models, and the JSGF Grammar as suggested. however the response obtained was not up to the mark. What I believe is merely tweaking the parameters or changes in the config.xml alone will not suffice. I think we would need a home grown algorithm to go along with sphinx 4 which can perform better speech recognition. From my side.. i have used the lextreeliguist, JSGFGrammar and the trigram language model. But the response was not great. perhaps because the audio input was not exactly american english. Will work on it a bit more .. and let you know my results

Use CMU's sphinx4 to transcribe non-digits data

2 Answers2