Im looking for a speech recognition software for java that acts more like the android version, in that, instead of having .gram files and stuff, it just returns a string of what was said, and I can act on it. Ive tried using sphinx-4, but using .gram files makes my program a lot harder to do.
-
1The point of a grammar file is to improve the accuracy of what you're getting back. Instead of trying to come up with random strings of english words, you tell it to expect specific input. That said, sphinx-4 can do simple large-dictionary ASR as well. What are you having trouble with? – Aleksandr Dubinsky Dec 21 '12 at 21:33
-
1Im working on a "siri" type thing, so having a large .gram file will get annoying. I also look for words inside the string, so for example if they say the word "weather" I will assume they are asking for the weather. But when they are saying this, they could say it in lots of ways - "Whats the weather" "Is the weather nice" "whats the weather going to be like tomorrow" and so on. With android, it returns a string of what was said nice and easy. .grams mean I would have to add each possible respone, which would decrease the usefulness of the program, as the user might say it differently. – JPatrickDev Dec 21 '12 at 21:39
1 Answers
The point of a grammar file is to improve the accuracy of what you're getting back. Instead of trying to come up with random strings of english words, you tell it to expect specific input.
That said, sphinx-4 can do ordinary large-dictionary ASR as well. Read the N-Gram part of this tutorial and look at the Transcriber sample that comes with the sphinx source code.
In addition, you can train your own trigram model that will enhance the results you get. (E.g., place more probability on the word "weather" being detected.) This is certainly what Siri does. Apple/Google have a huge corpus of pieces of audio that people speak into their phones, part of which is human transcribed, from which they train both acoustic and linguistic models (so their engines detect things people typically say instead of nonsense).

- 22,436
- 15
- 82
- 99
-
Great, thanks for the help, will look into the N-Gram part, and the trigram models, thanks :) – JPatrickDev Dec 21 '12 at 21:45
-
Take a look at this other answer http://stackoverflow.com/questions/8727389/dictation-application-using-sphinx4 it links to a page that has some language models. I prefer using MITLM for generating language models. But for now, you don't have to worry about that. Training an acoustic model will be more important since it helps to train it specific to the recording conditions (different microphones, background noise if someone uses it outdoors, etc.) You can also incorporate speaker adaptation for each of your users. See here: http://cmusphinx.sourceforge.net/wiki/tutorialadapt – Aleksandr Dubinsky Dec 22 '12 at 07:05