1

I'm working on a personal project involving microphones in my apartment that I can issue verbal commands to. To accomplish this, I've been using the Microsoft Speech API, and specifically RecognitionEngine from System.Speech.Recognition in C#. I construct a grammar as follows:

// validCommands is a Choices object containing all valid command strings
// recognizer is a RecognitionEngine
GrammarBuilder builder = new GrammarBuilder(recognitionSystemName);
builder.Append(validCommands);
recognizer.SetInputToDefaultAudioDevice();
recognizer.LoadGrammar(new Grammar(builder));
recognizer.RecognizeAsync(RecognizeMode.Multiple);

// etc ...

This seems to work pretty well for the case when I actually give it a command. It hasn't misidentified one of my commands yet. Unfortunately, it also tends to pick up random talking as commands! I've tried to ameliorate this by prefacing the command Choices object with a "name" (recognitionSystemName), which I address the system as. Oddly, this doesn't seem to help. I am restricting it to a set of predetermined command phrases, so I would have thought that it would be able to detect if speech wasn't any of the strings. My best guess is that it's assuming that all sound is a command and picking the best match from the command set. Any advice on improving this system so that it no longer triggers off of conversation not directed at it would be very helpful.

Edit: I've moved the name recognizer to a separate SpeechRecognitionEngine, but the accuracy is awful. Here's a bit of test code I wrote to examine the accuracy:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Speech.Recognition;

namespace RecognitionAccuracyTest
{
    class RecognitionAccuracyTest
    {
        static int recogcount;
        [STAThread]
        static void Main()
        {
            recogcount = 0;
            System.Console.WriteLine("Beginning speech recognition accuracy test.");

            SpeechRecognitionEngine recognizer;
            recognizer = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));
            recognizer.SetInputToDefaultAudioDevice();
            recognizer.LoadGrammar(new Grammar(new GrammarBuilder("Octavian")));
            recognizer.SpeechHypothesized += new EventHandler<SpeechHypothesizedEventArgs>(recognizer_SpeechHypothesized);
            recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
            recognizer.RecognizeAsync(RecognizeMode.Multiple);

            while (true) ;
        }

        static void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            System.Console.WriteLine("Recognized @ " + e.Result.Confidence);
            try
            {
                if (e.Result.Audio != null)
                {
                    System.IO.FileStream stream = new System.IO.FileStream("audio" + ++recogcount + ".wav", System.IO.FileMode.Create);
                    e.Result.Audio.WriteToWaveStream(stream);
                    stream.Close();
                }
            }
            catch (Exception) { }
        }

        static void recognizer_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
        {
            System.Console.WriteLine("Hypothesized @ " + e.Result.Confidence);
        }
    }
}

If the name is "Octavian", it recognizes stuff like "Octopus", "Octagon", "Volkswagen", and "Wow, really?". I can clearly hear the difference in the associated audio clips. Any ideas on making this not awful would be great.

Octavianus
  • 421
  • 4
  • 12
  • So how did you end up solving your problem. I am not sure the marked answer really solves it. Can you share anything on what made it better. I seem to be in a similar situation where the recognizer is just recognizing to much AND with a high confidence rate. – darbid Sep 16 '13 at 19:32
  • The marked answer is a marginal improvement. What I did to make it much better was to switch to SRGS grammars, and have a element as the first item. Then, when I get a result, I compare the first word with my system's name. If it doesn't match, I discard the result. Sometimes I have to repeat myself, but I've virtually eliminated false positives by doing this. – Octavianus Sep 17 '13 at 01:46
  • Thank you for taking the time. By do you mean something like MS examples here? http://msdn.microsoft.com/en-us/library/ms723634%28v=vs.85%29.aspx – darbid Sep 17 '13 at 12:27
  • I am pretty sure that my question here is also another alternative approach to solving this issue. http://stackoverflow.com/questions/18821566/accuracy-of-ms-system-speech-recognizer-and-the-speechrecognitionengine?noredirect=1#comment27818909_18821566 – darbid Sep 17 '13 at 16:29
  • I used the SrgsRuleRef class and SrgsDocument objects to generate my grammars. SrgsRuleRef.Dictation represents the dictation element - unfortunately, documentation on this is nonexistent, so I haven't figured out how to limit it to one word of dictation, but it seems to work fairly well regardless. – Octavianus Sep 18 '13 at 05:03

4 Answers4

2

Let me make sure I understand, you want a phrase to set apart commands to the system, like "butler" or "Siri". So, you'll say "Butler, turn on TV". You can build this into your grammar.

Here is an example of a simple grammar that requires an opening phrase before it recognizes a command. It uses semantic results to help you understand what was said. In this case the user must say "Open" or "Please open" or "can you open"

    private Grammar CreateTestGrammar()
    {
        // item
        Choices item = new Choices();
        SemanticResultValue itemSRV;
        itemSRV = new SemanticResultValue("I E", "explorer");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("explorer", "explorer");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("firefox", "firefox");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("mozilla", "firefox");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("chrome", "chrome");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("google chrome", "chrome");
        item.Add(itemSRV);
        SemanticResultKey itemSemKey = new SemanticResultKey("item", item);

        //build the permutations of choices...
        GrammarBuilder gb = new GrammarBuilder();
        gb.Append(itemSemKey);


        //now build the complete pattern...
        GrammarBuilder itemRequest = new GrammarBuilder();
        //pre-amble "[I'd like] a"
        itemRequest.Append(new Choices("Can you open", "Open", "Please open"));

        itemRequest.Append(gb);

        Grammar TestGrammar = new Grammar(itemRequest);
        return TestGrammar;
    }

You can then process the speech with something like:

RecognitionResult result = myRecognizer.Recognize();

and check for semantic results like:

if(result.Semantics.ContainsKey("item"))
{
   string s = (string)result.Semantics["item"].Value;
}
Michael Levy
  • 13,097
  • 15
  • 66
  • 100
  • This is more-or-less what I have currently, with the exception that I've just been using a Hashtable instead of a semantic result key-value pair. So if I'm reading your answer correctly, the key mistake that I've made is constructing my phrase in the same GrammarBuilder as my command choices. Is this correct? – Octavianus Nov 01 '11 at 16:09
  • You don't have to use the SemanticResultValue. I just had the code sample available to copy-paste. But, I think the answer is you want to add two things to your grammar: the "wake up" expression, then add the list of choices for your available commands. Otherwise, what is in your dictionary? If the wakeup expression is just one child in your dictionary, I don't think you'll get the results you want. – Michael Levy Nov 01 '11 at 16:22
  • 1
    http://msdn.microsoft.com/en-us/library/hh361662.aspx has a good example. The phrase "I want to fly from" is required by the grammar. Then a list of valid choice is added after the required phrase. – Michael Levy Nov 01 '11 at 16:27
  • From the code I posted, my grammar is built from a single GrammarBuilder. That GrammarBuilder is initialized with the name of the system ('Butler' for example), and then a Choices object with the valid commands is appended. I think that this corresponds to what you said in your first comment about what's in my grammar. I was going off this: http://msdn.microsoft.com/en-us/library/ms554229.aspx which has more or less what I'm doing, but the issue is that it still picks up a lot of normal conversation that's not directed at it as commands. – Octavianus Nov 01 '11 at 16:32
  • Ok, i see your example more clearly now. I misunderstood what recognitionSystemName was. So, yes, it appears you're doing what I've suggested. – Michael Levy Nov 01 '11 at 16:35
  • I've gone ahead and split the recognizer for the name to a separate SpeechRecognitionEngine, but the accuracy is horrible. If the name is, say, "Octavian", it matches "Octavian", "Octagon", "Octahedron", "Octopus", "Volkswagen", "Wow, really?" and a few other things some of which weren't even words. – Octavianus Nov 02 '11 at 13:04
1

I'm with the same problem too. I'm using Microsoft Speech Platform, so it could be a little different in accuracy etc.

I'm using Claire as a wake-up command, but it's true that it recognizes different words as Claire too. The problem is that the engine hears you speak and search for the closest match.

I didn't found a really good solution to this. You could either try out to filter the recognized speech with the Confidence field. But it's not very reliable with my chosen recognizer engine. I just throw every word that I want to recognize in one big SRGS.xml and set the repeat value to 0-. I only accept the recognized sentence as Claire is the first word. But this solution is not what I want, as it doesn't work as good as I wish, but still it's a little improvement.

I'm currently busy with it, and I will post more info as I progress.

EDIT 1: As a comment to what Dims says: It's possible in a SRGS Grammar to add a "Garbage" rule. You might want to look into that. http://www.w3.org/TR/speech-grammar/

Eric Smekens
  • 1,602
  • 20
  • 32
  • I eagerly look forward to hearing your progress, but Garbage seems to be aimed at preventing individual words or phrases in the middle of sentences. From the spec: `Defines a rule that may match any speech up until the next rule match, the next token or until the end of spoken input. A grammar processor must accept grammars that contain special references to GARBAGE. The behavior GARBAGE rule is implementation-specific. A user agent should be capable of matching arbitrary spoken input up to the next token but may treat GARBAGE as equivalent to NULL (match no spoken input).` – Octavianus Nov 03 '11 at 14:19
0

In principle, you need to update either grammar or dictionary to have "empty" or "anything" entries there.

Dims
  • 47,675
  • 117
  • 331
  • 600
0

Is it possible that you just need to run UnloadAllGrammars() prior to creating/loading the grammar that you want to use?

Ed.
  • 928
  • 1
  • 10
  • 23