6

[possibly duplicate] But I didn't find answers to my questions below.

I've been doing some research on voice recognition for the past two days and I didn't get answers to my questions:

  1. Is it possible to run voice recognition as a service? I would like to implement something like this: I need to call a number though my phone through voice recognition is in sleep mode.
  2. Does voice recognition work properly to detect the words when I am on a train, bus, etc?
  3. Is there any sensor to detect the voice apart from the voice recognition?
  4. For voice recognition to work properly, does the user need to speak closer to the phone?
gniourf_gniourf
  • 44,650
  • 9
  • 93
  • 104
Ramesh Sangili
  • 1,633
  • 3
  • 17
  • 31
  • 7
    Do you mean *voice* recognition or *speech* recognition? (Read the [tag:voice-recognition] excerpt: "Voice Recognition means identification of the person talking and is frequently misapplied to mean "Speech Recognition" - identification of what is being said.") – Chris Morgan Dec 24 '12 at 23:59
  • 2
    Next time please try to search one question at a time and ask one question at a time. That will help you to find the answer. – Nikolay Shmyrev Dec 25 '12 at 06:29

1 Answers1

10

1) It is proper approach to put voice recognition into a service, like it is made in Google api, where callback methods are used to get results. To make it run continously, service must deal with wakelock that will avoid falling in sleep mode. Some more information is provided here Wake locks android service recurring It has one big disadvantage - high battery usage, cause by continuous work of CPU and coninuous computations of incoming sound data. (Can be reduced with filters, thresholds etc.)

2) Voice recognition is not a simple task. It desires huge number of calculation and data to reference to. If input audio is not clear (noise, many human voices etc.), it is harder to get proper output. What can be done to make accuracy better is, filter input audio: noise suppresion, low pass filter etc. You cannot expect 100% accuracy, but 80-95 % can be achieved.

Harder is to filter many human voices. But there can be used some simple amplitude (audio strength level) algorithms with adaptive threshold that decides when word begins and ends. Idea is that the proper voice is the loudest = nearest to phone/device. So according to 4) accuracy is better when user speak close to microphone, because it is the loudest voice.

3) I dont know what you mean by sensor, but there are algorithms to simply detect human voice rather that decode words. These algorithms are called Voice Activity Detection (VAD) Some code should be found in Speex project documentation http://www.speex.org/

Simplest method to handle voice recognition is to use Google Speech api wich is pretty good, and it recognize plenty of languages but need an Internet connection - and it takes a while to get result.
Faster is CMU Sphinx but it has few language models, needs more RAM memory and proccesor computation since all decoding is made on device. In my opininon it very good when dicitionary (words that are revognized) is small like commands (left,right, backward, stop, start, etc).

Community
  • 1
  • 1
MP23
  • 1,763
  • 20
  • 25
  • Thank you for your inputs. I wanted to develop an application that will automatically call emergency number when I say Help no matter where I am. I could be inside a bus, market or somewhere. I didnt get answer for my 1st question, please let em know, if you have any inputs? – Ramesh Sangili Dec 31 '12 at 17:22
  • I updated my answer, so it covers your first question as well, in your case Sphinx would be very fast, and very aqqurate since there is only one word needed to be recognized. "HELP" – MP23 Jan 01 '13 at 15:26
  • Again thanks for your inputs. Regarding the 4th question & answer, as you mentioned, we need to speak close to the microphone for better accuracy. Assume I am on Bus and I need some help where its too crowded and too noisy, it may not help to recognized the word correct? – Ramesh Sangili Jan 02 '13 at 19:01
  • 1
    "To crowded" means that there are many other human voices. You can imagine that it is hard to filter this kind of signal. When there are only other noises with higher or lower frequency than human speech (300Hz - 3kHz) it is easier to make something like gate which passes only specified wave band (human). So to make it easier, and reach better accuracy, another method is used: To qualify proper signal not by frequency but by strength (amplitude). Of course in advanced algorithms many technics, both for frequency and amplitude, are used, – MP23 Jan 02 '13 at 22:13
  • 1
    So to sum up, it is harder to get proper result in noisy place, and accuracy depends of, how good algorithms are implemented. And of course it depends on how good is algorithm that takes care of voice recognition. But I have spoken about Google Speech and Sphinx and they make it really good – MP23 Jan 02 '13 at 22:17