How to setup a Speech Recognition Server?

Question

How to implement Speech recognition at server side (please don't suggest HTML5's x-webkit-speech, javascript etc) ? The program will take an audio file as input and with sufficient accuracy provides the text transcription of audio file. What are the options I can use ?

I have tried implementing Sphin4 with Voxforge model but the accuracy is so poor (their may be also some problem in my configuration, I am still trying to learn it). In one post I read that when we use <input name="speech" id="speech" type="text" x-webkit-speech /> the input is sent to an external server and that server than does the recognition and sends the data back to the browser.

How can I setup that server ? Any existing open Source server would be also useful if it can recognize English sentences with minimal error rate.

score 3 · Answer 1 · edited Apr 23 '13 at 14:26

What type of application are you implementing? Is the purpose of the application to transcribe user spoken input into text or is it meant to just understand simple commands? Systems like Sphinx4 use a statistical model for transcription of speech. You will not get as good recognition with these types of systems as you would with an automated speech recognition (ASR) system that uses grammars to restrict the search space for the ASR to get better recognition. Systems that use statistical models require a lot of tuning and trial runs to get decent recognition.

Sphinx4 is the only opens source ASR that I am aware of. There are a number of commercial products/services with Nuance being the biggest in the market. Some of the commercial offerings have the option to include humans to transcribe the message when recognition rates are low.

Google has an unofficial API that it uses internally for services like Google Voice and I believe it is the same one used by the webkit you reference. Google Voice will take voice mail messages transcribe them and email the text to you. Google Voice is considered state of the art for transcription, but if you have a Voice account you will see that the transcribed messages are not that great. Here is a link to a blog article on using the unofficial Google Speech API.

it will be a dictation application which transcribes the user voice to text format... I am trying to configure Sphinx4 but so far couldn't succeed in that..see http://stackoverflow.com/questions/8727389/dictation-application-using-sphinx4 — Amit, Jan 18 '12 at 14:20

score 1 · Answer 2 · edited May 23 '17 at 12:06

In Chrome, that server is a proprietary Google Server. You can't set up you own version. People have reverse engineered the calls to the server, see http://mikepultz.com/2011/03/accessing-google-speech-api-chrome-11/ for an example, but this is not a good idea for a production or commercial application since Google may change the API or limit its access at any time.

Here is an old answer to a different question, but it may be helpful - https://stackoverflow.com/a/6351055/90236

score 1 · Accepted Answer · edited Apr 04 '19 at 14:40

You have some problems: 1. How to capture audio in a client. 2. How to transfer these audio for a server. 3. How to make recognition. 4. How to transfer back the recognition and confidence score. 5. What are you going to do with these recognition and confidence score (your application).

For the first case, you can use Google approach that someone click in a microphone icon, record the voice for some times. Or, iPhone Siri, where a VAD is used to record audio.

Second, it is basic a TCP/IP file transfer problem. It is also possible to use Apple / Google approach and compress audio file using Flac or Speex.

Third, this is the really hard part. You need much better acoustic models that ones that you can get from Voxforge. This is special true for a continuous speech recognition, context free like Siri. For commands, Voxforge is fine.

Forth, it is another file transfer problem.

Fifth, it is your application.

The hard part is speech recognition part. Perhaps other problem is how to scale this for thousands of users. You can use Julius speech recognition as a speech client to capture audio. We can chat more about this problem privately.

The application I am will be developing will be installed at server and the 'decoded text' will be the output of this module which will be used by someother module of the Application... I have tried configuring Sphinx4 with VoxForge and HUB... but so far nothing is working see the question http://stackoverflow.com/questions/8727389/dictation-application-using-sphinx4, please help... — Amit, Jan 18 '12 at 14:18

How to setup a Speech Recognition Server?

3 Answers3