17

I am looking for an API for ios (free ideally) that will allow to do some speech recognition. I have seen few posts for this: iPhone speech recognition API? and free speech recognition engines for iOS? and after a bit of prospect i have gathered the sdk that looks quite interesting:

is there any of those that really stand out of the crowd and quite recent? how do they really differentiate from each other?

Community
  • 1
  • 1
tiguero
  • 11,477
  • 5
  • 43
  • 61

3 Answers3

16

If you want to track just few keywords, you should not look for speech recognition API or service. This task is called Keyword Spotting and it uses different algorithms than speech recognition. Speech recognition tries to find all the words that has been said and because of that it consumes way more resources than keyword spotting. Keyword spotter only tries to find few selected keywords or keyphrases. It's way simple and way less resource consuming.

The only possible solution to archive this funcitonality is to use open source package like OpenEars powered by Pocketsphinx

http://www.politepix.com/openears

Openears has Rejecto plugin that implements something similar.

Pocketsphinx itself has recently implemented open source effective keyword spotting too, but it didn't get into Openers yet. It's only available through pocketsphinx API, you need to create kws search and set the target word to look for. I hope soon this functionality will reach OpenEars too.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • 3
    But `OpenEars`' accuracy is pretty inconsistent and is annoying. Can you suggest something better. – Abhishek Bedi Feb 12 '13 at 06:36
  • @AbhishekBedi: OpenEars accuracy is just great for me, probably you are not using it correctly. You need to provide more information in order to get help on that. – Nikolay Shmyrev Feb 12 '13 at 10:23
  • I followed the tute provided at [http://www.politepix.com/openears/]. But i dont know how to work upon the score – Abhishek Bedi Feb 12 '13 at 12:47
  • You are welcome to desribe your problem in a better way and provide exactly details what are you doing, what are you expecting to get and what do you get actually. To make analisys easy you need to provide the recordings of your voice. This problem is easy to solve as long as you provide enough information. – Nikolay Shmyrev Feb 12 '13 at 14:25
  • 3
    From using the demo of openears, I was not impressed initially by the demo. I tried to say, "TESTING" and it replied, you said "TURN". Or saying, "NO" and it says, you said "GO GO". Now, once I realized there were only a fixed set of words I could use, that improved my experience. Also, I think it is almost required to have the Rejecto plugin to reject words not in your fixed word set. – christophercotton Jul 31 '13 at 17:07
  • Yes, Rejecto plugin is highly recommended. – Nikolay Shmyrev Jul 31 '13 at 19:19
  • OpenEars has been very hit-and-miss for me as well. The more words I add, the more it picks wrong words--often words that, to me, sound nothing alike ("list" and "node" for example). When I tried Rejecto, it simply replaced this "wrong word" behavior with doing nothing instead (or optionally reporting null hypotheses)...which I guess is maybe a little better, but not much. I'm not trying to pick on OpenEars in particular...I just think speech recognition as a whole still basically stinks (try using Siri with any sort of background noise). It's a hard problem. – Reid Jun 10 '14 at 17:00
3

Nuance gives developers free access (but not for high volume) - See http://www.masshightech.com/stories/2011/09/26/daily13-Nuance-tweaks-mobile-dev-program-with-free-access-to-Dragon.html or http://dragonmobile.nuancemobiledeveloper.com/public/index.php?task=home

Nuance services are typically offered commercially and require up front fees and transaction fees. The interesting news above is that they now make low volume use of their services available to developers for free. So, for development, testing, and demonstration you can probably use the free Nuance services. However, unlike the Google services that come free in Android, if your app has thousands of users you will likely have to pay for Nuance services.

Michael Levy
  • 13,097
  • 15
  • 66
  • 100
  • Thanks Michael - how does it difference from openears or ispeech which are also free? what do u mean by high volume: the amount of data that needs to be treated to extract keywords? sorry I don't know much about speech recognition. In my case i would need to extract few keywords (max 4/5) continuously: I don't want the user to interact with the app to enter a mode where speech recognition is on. – tiguero Feb 09 '12 at 18:48
  • 1
    Nuance is the industry leader in commercial speech recognition. They are like Cisco in networking or EMC in storage. They are a huge successful company with industry leading technology. It is beleived that Nuance provides the recognition technology behind Apple's Siri. OpenEars (I believe) is a open source iOS library for Sphinx and other open source recognizers. iSpeech comes from a small team from New Jersey who seem to be famous for the DriveSafe.ly application. Sorry, I don't know too much about them. – Michael Levy Feb 10 '12 at 15:19
2

We have been developing CeedVocal SDK since 2008, it's based on Julius & FLite open source projects.

Here's some context: we wanted to make our app (Vocalia) for speech recognition back in 2008 and basically picked Julius (hesitated with Pocket Sphinx, which appears to be good as well) and optimized its file format so that it would boot in 1-2 sec instead of 20sec on the original iPhone. Then we dutifully trained our own acoustic models in 6 languages. We designed the API, and eventually decided to offer it to other developers as an SDK.

CeedVocal basically supports 2 modes of operation:

  1. matching of words (or small phrases)
  2. keyword spotting

In the first mode of operation, it tries to align the input speech to a word (or phrase) in its list of acceptable input. This forces the input to a pre-known word, even if the speech is something else. Accuracy is good. In the second mode of operation, it will try to pick one of its keywords into the stream of speech. This is a difficult case, and it can be less accurate.

rsebbe
  • 334
  • 3
  • 7