iOS / C: Algorithm to detect phonemes

Question

I am searching for an algorithm to determine whether realtime audio input matches one of 144 given (and comfortably distinct) phoneme-pairs.

Preferably the lowest level that does the job.

I'm developing radical / experimental musical training software for iPhone / iPad.

My musical system comprises 12 consonant phonemes and 12 vowel phonemes, demonstrated here. That makes 144 possible phoneme pairs. The student has to sing the correct phoneme pair 'laa duu bee' etc in response to visual stimulus.

I have done a lot of research into this, it looks like my best bet may be to use one of the iOS Sphinx wrappers ( iPhone App › Add voice recognition? is the best source of information I have found ). However, I can't see how I would adapt such a package, can anyone with experience using one of these technologies give a basic rundown of the steps that would be required?

Would training be necessary by the user? I would have thought not, as it is such an elementary task, compared with full language models of thousands of words and far greater and more subtle phoneme base. However, it would be acceptable (not ideal) to have the user train 12 phoneme pairs: { consonant1+vowel1, consonant2+vowel2, ..., consonant12+vowel12 }. The full 144 would be too burdensome.

Is there a simpler approach? I feel like using a fully featured continuous speech recogniser is using a sledgehammer to crack a nut. It would be far more elegant to use the minimum technology that would solve the problem.

So really I'm hunting for any open source software that recognises phonemes.

PS I need a solution which runs pretty much real-time. so even as they are singing the note, firstly it blinks on to illustrate that it picked up the phoneme pair that was sung, and then it glows to illustrate whether they are singing the correct note pitch

Why the downvote and vote to close? This is a valid coding question, and I have supplied a realworld usage context. Did somebody have a bad day at the office?? — P i, Jun 17 '11 at 16:36
I didn't downvote, but your question is broad and although specific in topic, it's vague in nature. Your question is not so much code related as it conceptual. That's not bad, per se, but large vague questions may be discouraged. — Moshe, Jun 17 '11 at 16:43
@Moshe, No. There is nothing vague in my question. I seek code for realtime phoneme recognition. Additionally I specify a limitation on the phoneme base. — P i, Jun 19 '11 at 04:19

score 5 · Answer 1 · answered Jun 30 '11 at 08:59

If you are looking for a phone-level open source recogniser, then I would recommend HTK. Very good documentation is available with this tool in the form of the HTK Book. It also contains an entire chapter dedicated to building a phone level real-time speech recogniser. From your problem statement above, it seems to me like you might be able to re-work that example into your own solution. Possible pitfalls:

Since you want to do a phone level recogniser, the data needed to train the phone models would be very high. Also, your training database should be balanced in terms of distribution of the phones.
Building a speaker-independent system would require data from more than one speaker. And lots of that too.
Since this is open-source, you should also check into the licensing info for any additional details about shipping the code. A good alternative would be to use the on-phone recorder and then have the recorded waveform sent over a data channel to a server for the recognition, pretty much something like what google does.

i am glad that helped you out. – Sriram Jul 05 '11 at 09:55 — Sriram, Jul 05 '11 at 09:55

score 4 · Answer 2 · answered Jun 22 '11 at 21:53

I have a little bit of experience with this type of signal processing, and I would say that this is probably not the type of finite question that can be answered definitively.

One thing worth noting is that although you may restrict the phonemes you are interested in, the possibility space remains the same (i.e. infinite-ish). User training might help the algorithms along a bit, but useful training takes quite a bit of time and it seems you are averse to too much of that.

Using Sphinx is probably a great start on this problem. I haven't gotten very far in the library myself, but my guess is that you'll be working with its source code yourself to get exactly what you want. (Hooray for open source!)

...using a sledgehammer to crack a nut.

I wouldn't label your problem a nut, I'd say it's more like a beast. It may be a different beast than natural language speech recognition, but it is still a beast.

All the best with your problem solving.

score 1 · Answer 3 · answered Jun 30 '11 at 14:12

1

Not sure if this would help: check out OpenEars' LanguageModelGenerator. OpenEars uses Sphinx and other libraries.

answered Jun 30 '11 at 14:12

Jacob M. Barnard

1,347
1
10
24

score 0 · Answer 4 · answered Jun 23 '11 at 03:41

0

http://www.hfink.eu/matchbox

This page links to both YouTube video demo and github source.

I'm guessing it would still be a lot of work to mould it into the shape I'm after, but is also definitely does do a lot of the work.

answered Jun 23 '11 at 03:41

P i

29,020
36
159
267

iOS / C: Algorithm to detect phonemes

4 Answers4

Linked