3

I need to identify the "quality" of the user's pronunciation with the help of Microsoft speech SDK (System.Speech.Recognition). I am using MS Speech Engine - US, so what I actually need is to find out how close the speaker's voice is to the "North American" accent.

One way of doing this is by checking how close the user's voice is to the US English phonetic pronunciation. As mentioned in MSDN, it seems like this process is done inside the speech SDK by it self, so I need to get that out. Since we can set the phonetic to the engine by our selves as well, I am sure this is possible.

However, I have no clear idea about what I have to do. So, what can I do to find out the quality of the user's pronunciation/How close it is to US North American English phonetic pronunciation? User will only have to speak pre-defined sentences like "Hello World. I am here".

Update

I got some kind of "phonemes" (as mentioned in MSDN) by the use of following code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Speech.Recognition;
using System.Speech.Synthesis;
using System.Windows.Forms;
using System.IO;

namespace US_Speech_Recognizer
{
    public class RecognizeSpeech
    {
        private SpeechRecognitionEngine sEngine; //Speech recognition engine
        private SpeechSynthesizer sSpeak; //Speech synthesizer
        string text3 = "";

        public RecognizeSpeech()
        {
            //Make the recognizer ready
            sEngine = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));


            //Load grammar
            Choices sentences = new Choices();
            sentences.Add(new string[] { "I am hungry" });

            GrammarBuilder gBuilder = new GrammarBuilder(sentences);

            Grammar g = new Grammar(gBuilder);

            sEngine.LoadGrammar(g);

            //Add a handler
            sEngine.SpeechRecognized +=new EventHandler<SpeechRecognizedEventArgs>(sEngine_SpeechRecognized);


            sSpeak = new SpeechSynthesizer();
            sSpeak.Rate = -2;



            //Computer speaks the words to get the phones
            Stream stream = new MemoryStream();
            sSpeak.SetOutputToWaveStream(stream);


            sSpeak.Speak("I was hungry");
            stream.Position = 0;
            sSpeak.SetOutputToNull();


            //Configure the recognizer to stream
            sEngine.SetInputToWaveStream(stream);

            sEngine.RecognizeAsync(RecognizeMode.Single);


        }


        //Start the speech recognition task
        private void sEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            string text = "";

            if (e.Result.Text == "I am hungry")
            {
                foreach (RecognizedWordUnit wordUnit in e.Result.Words)
                {
                    text = text + wordUnit.Pronunciation + "\n";
                }

                MessageBox.Show(e.Result.Text + "\n" + text);
            }


        }
    }
}

This is the direct code snippet related to the phonemes (extracted from the above code)

   //Start the speech recognition task
    private void sEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
    {
        string text = "";

        if (e.Result.Text == "I am hungry")
        {
            foreach (RecognizedWordUnit wordUnit in e.Result.Words)
            {
                text = text + wordUnit.Pronunciation + "\n";
            }

            MessageBox.Show(e.Result.Text + "\n" + text);
        }


    }

Following is my output. The phonemes I got are displayed starting from the second line. First line simply shows the recognized sentence

enter image description here

So, please tell me, according to the MSDN this is "phonemes". So, is this is the "phonemes" actually? I have never seen these, that is why.

above code is done according to this link http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.srgsgrammar.srgstoken.pronunciation(v=office.14).aspx

halfer
  • 19,824
  • 17
  • 99
  • 186
PeakGen
  • 21,894
  • 86
  • 261
  • 463

1 Answers1

4

Ok, here's how I'd approach the problem.

First, load up the dictation engine with the Pronunciation topic, which will return the phonemes spoken by the user (in the Recognition event).

Second, get the reference phonemes for the word using the ISpEnginePronunciation::GetPronunciations method (as I outlined here).

Once you have the two sets of phonemes, you can compare them. Essentially, the phonemes are separated by spaces, and each phoneme is represented by a short tag (described in the American English Phoneme Representation spec).

Given this, you should be able to compute a score by comparing the phonemes by any number of approximate string matching schemes (e.g., Levenshtein distance).

You might find the problem simpler by comparing phone IDs rather than strings; ISpPhoneConverter::PhoneToId can convert the phoneme strings to an array of phoneIDs, one ID per phoneme. That would give you a pair of null-terminated integer arrays, perhaps better suited for your comparison algorithm.

You could use the engine confidence to penalize matches, as low engine confidence indicates that the incoming audio doesn't closely match the engine's idea of the phoneme.

Community
  • 1
  • 1
Eric Brown
  • 13,774
  • 7
  • 30
  • 71
  • Hello, I am doing with C#. I did some work to get the phonemes and it generated some characters. I am posting the code and a screenshot of my output, please be kind enough to tell me whether these are the "phonemes". Please see my question again, I am editing it and applying these data – PeakGen May 29 '13 at 15:12
  • Yes, those phonemes look roughly correct, although it looks like it's using the IPA (International Phoneme Alphabet) phoneme set, rather than the SAPI phoneme set. – Eric Brown May 29 '13 at 19:16
  • 1
    Since you're using C# and System.Speech.Recognition, you should also use DictationGrammar("grammar:dictation#Pronunciation"); if that does not work, you'll have to fall back to C++. (You might still need to do use C++ for some parts, as ISpEnginePronunciation isn't exposed through System.Speech.Recognition.) – Eric Brown May 29 '13 at 19:29
  • Thanks a lot for the reply. "ISpEnginePronunciation " is letting te computer to speak and get the phonemes of computer speak right? I did it by loading the computer speak into a "stream" and throwing it to the recognizer. That code also above under the comment "//Computer speaks the words to get the phones". :) What do you think? – PeakGen May 29 '13 at 21:44
  • another question, when the custom grammar is given, it is always giving the same output even you sound it differently. but it is not happenin with dictation grammar. why is this? – PeakGen May 29 '13 at 22:07
  • No. ISpEnginePronunciation is a way for applications to query the SR engine about how the engine believes that words should be pronounced. It does *not* use the TTS engine to do this. Also, when you're using a custom grammar, the SR engine will try very hard to match the audio input to the grammar. The dictation grammar can match almost anything, so the SR engine doesn't need to try that hard. (On the other hand, the dictation grammar can provide unexpected results, particularly when the engine hasn't been trained.) – Eric Brown May 30 '13 at 01:26
  • anoter question (sorry, I know too many) my lecturer said it will be enough if I can find how close the computer pronunciation to the user's pronunciation. In this case, my system will work fine right? I mean matching pc pronunciation phonemes with user's pronunciation phonemes? – PeakGen May 30 '13 at 05:44
  • Yes, although you'll need to find a way to compare two phonemes for similiarity (e.g., when the user says the wrong phoneme, instead of adding/deleting phonemes). The SAPI Phonetic Alphabet Reference can help you here, as it breaks down the consonants & vowels into features. – Eric Brown May 31 '13 at 02:09