11

I need to recognize a alphabet spoken by user into a device's microphone. The device could be a Android driven mobile phone.

For example, when user says 'R' it should give me 'R' and not 'Are'..

How to accomplish this spoken letter recognition in Java? I am looking for ideas which can be easily expressed in code.

Edit

Based on one suggestion by @David Hilditch, I came up with this following map of characters and their sounding words.

A - ye,a,yay 
B - be, bee, 
C - see, sea, 
D - thee, dee, de
E - eh, ee, 
F - eff, F
G - jee, 
H - edge, hedge, hatch, itch
I - Aye, eye, I
J - je, jay, joy
K - kay, ke, 
L - el, yell, hell
M - am, yam, em
N - yen, en,
O - oh, vow, waw
P - pee, pay, pie
Q - queue,
R - are, err, year
S - yes, ass, S
T - tee, tea, 
U - you, U
V - we, wee,
W - double you, 
X - axe
Y - why
Z - zed, zee, jed
halfer
  • 19,824
  • 17
  • 99
  • 186
Ron
  • 24,175
  • 8
  • 56
  • 97

6 Answers6

6

You could use get the text from voice using Google's API (have a quick look to http://developer.android.com/reference/android/speech/RecognizerIntent.html ).

Then, if you want to infer the language (and then, the alphabet) you could use an open project called "Language detector" based on n-grams:

http://code.google.com/p/language-detection/

You could combine it using "dictionary coincidences" and other features that you can get from the text.

arutaku
  • 5,937
  • 1
  • 24
  • 38
  • I have seen the first link before.. I dont want to launch another activity for taking voice input. I'll check the second link.. – Ron Sep 16 '12 at 18:07
  • The second link is how to use the text (once you have it) to infer language -> alphabet, as you mentioned in your question first time I read it. – arutaku Sep 16 '12 at 18:10
  • I'm afraid you have to launch another activity unless you code the whole voice recognizer. I always use the google's one and it works really good. – arutaku Sep 16 '12 at 18:11
  • I need to recognize only alphabets.. not whole words..e.g. when user says 'R' I want it to give me 'R' and not 'Are'.. Will `RecognizerIntent` work this way? – Ron Sep 16 '12 at 18:32
4

I think a good option is to follow the guidlines @rmunoz posted. But if you do not want to use an external activity, then I am afraid, you have to code text recognition by yourself. I am also not sure, how good is speech recognition for letters in android. I suppose the mechanisms behind were trained for words.

I think this would be best accomplished with Neural Networks. Firstly, you will have to collect a lot of samples of different people saying letters (for each letter you get lets say 2 examples from a person). You would also denote the letter, which the person said. So suppose in such a way, you get 52 examples from a person and you have 10 people participating. Now you acquired 520 examples of spoken letters. After that you have to construct your Neural Network from the supplied examples. A very good tutorial is here: https://www.coursera.org/course/ml. Then you just have to remember this neural network (the parameters in the neural network), and use it for classification. The person speaks something in their microphone, and the neural network classifies the newly acquired example with a letter.

There is only one problem. How to represent the user inputted sound, so that the neural network can be trained and later classify this sound. You have to compute some spectral features of the inputted sound. You can read something about that in http://www.cslu.ogi.edu/tutordemos/nnet_recog/recog.html. But I strongly advise you, to view the first link before diving into next (if you do not know anything about neural networks yet).

Other answers have an assumption, that you can already recognize words such as "Are". But from my understanding of the question, this is not the case. So the mapping posted in the question will not help you.

Nejc
  • 692
  • 6
  • 13
3

If you already have your Java program successfully recognising the word 'Are' when someone says 'R' then why not just enumerate the 26 letter words and translate them?

e.g.

Ay, Aye, Ai -> A
Bee, Be -> B
Sea, See -> C
Dee, Deer, Dear -> D

Is that too simplistic? Seems like it would work to me and you can use any speech recognition software you like.

You have the advantage of having a very restricted sphere of context here (letters of the alphabet) so it's going to take you less than an hour to configure this.

You can keep a record of any words which don't successfully translate and manually listen to them to improve your enumeration.

Having said that, I'm sure most decent speech recognition software would have an option to restrict the system to recognising letters and numbers rather than words, but if not, try my solution - it'll work.

To build your enumeration, simply talk to your system and get it to translate as you recite the alphabet.

Dave Hilditch
  • 5,299
  • 4
  • 27
  • 35
2

I'm coming from a Speech Rec background on IVR's, but you could use a custom language grammar to define what are valid utterances.

I believe you can use something like http://cmusphinx.sourceforge.net/wiki/ or http://jvoicexml.sourceforge.net/ to perform the actually recognition.

and the grammar you would load could look like:

#JSGF V1.0;

grammar alphabet;

public <alphabet> = a | b| c |d | e;  //etc.....

Its a bit redundant recognizing letters in a grammar that are already part of the language - but its a easy way to restrict the recognizer returning only utterance's you want to deal with.

NathanS
  • 163
  • 2
  • 10
2

David is right. Since your output set is limited, you have the option of hand-coding rules like Are->R.

The problem is with letters that sound similar. For example, the person may have said N, but your system recognizes it as M. You can take a look at language modeling to predict likely character sequences. For example, if your user said 'I' before and 'G' after, a bidirectional language model will give higher probability to 'N' than 'M'.

And dictionary-based approaches work fine too. If interpreting the letter leads to a word in the dictionary vs one not in the dictionary Eg: "NOSE" vs "MOSE", choose the one which is valid.

Sau
  • 326
  • 2
  • 15
2

Any Speech-to-text platform should work as needed. This post discusses some of the available options, which include the built-in speech-to-text, an open-source option called CMUSphinx, and a free, closed source option from Microsoft.

Community
  • 1
  • 1
Phil
  • 35,852
  • 23
  • 123
  • 164