SpeechRecogntion quality is extremely poor especially compared to Word

Question

I'm using the WPF speech recognition library, trying to use it in a desktop app as an alternative to menu commands. (I want to focus on the tablet experience, where you don't have a keyboard). It works - sort of, except that the accuracy of recognition is so bad it's unusable. So I tried dictating into Word. Word worked reasonable well. I'm using my built-in laptop microphone in both cases, and both programs are capable of hearing the same speech simultaneously (provided Word retains keyboard focus), but Word gets it right and WPF does an abysmal job.

I've tried both a generic DictationGrammar() and a tiny specialised grammar, and I've tried both "en-US" and "en-AU", and in all cases Word performs well and WPF performs poorly. Even comparing the specialised grammar in WPF to the general grammar in Word, WPF gets it wrong 50% of the time e.g. hearing "size small" as "color small".

    private void InitSpeechRecognition()
    {
        recognizer = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));

        // Create and load a grammar.  
        if (false)
        {
            GrammarBuilder grammarBuilder = new GrammarBuilder();
            Choices commandChoices = new Choices("weight", "color", "size");
            grammarBuilder.Append(commandChoices);
            Choices valueChoices = new Choices();
            valueChoices.Add("normal", "bold");
            valueChoices.Add("red", "green", "blue");
            valueChoices.Add("small", "medium", "large");
            grammarBuilder.Append(valueChoices);
            recognizer.LoadGrammar(new Grammar(grammarBuilder));
        }
        else
        {
            recognizer.LoadGrammar(new DictationGrammar());
        }

        // Add a handler for the speech recognized event.  
        recognizer.SpeechRecognized +=
                            new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);

        // Configure input to the speech recognizer.  
        recognizer.SetInputToDefaultAudioDevice();

        // Start asynchronous, continuous speech recognition.  
        recognizer.RecognizeAsync(RecognizeMode.Multiple);
    }

Sample results from Word:

Hello 
make it darker 
I want a brighter colour 
make it reader 
make it greener 
thank you 
make it bluer 
make it more blue
make it darker 
turn on debugging 
turn off debugging 
zoom in 
zoom out

The same audio in WPF, dictation grammar:

a lower
make it back
when Ted Brach
making reader
and he
liked the
ethanol and
act out
to be putting
it off the parking
zoom in
and out

I got the assembly using Nuget. I'm using Runtime version=v4.0.30319 and version=4.0.0.0. If I'm supposed to "train" it, the documentation doesn't explain how to do this, and I don't know if the training is shared with other programs such as Word, or where the training is saved. I've been playing around with it long enough now for it to know the sound of my voice.

Can anyone tell me what I'm doing wrong?

You need to configure the type of topic constraints and the constraint probability. The type of topic constraints defines in what "style" the speech recognizer will recognise the speech, and the constraint probability defines what degree of accuracy is desired to be achieved (min, normal, max). Check the update in my comment to guide yourself and others. — teodor mihail, Apr 01 '23 at 23:37

teodor mihail · Answer 1 · 2023-04-01T23:59:27.290

If everyone needs to use a speech recognition engine that has the accuracy of Cortana (because it is using Cortana's speech recognition engine) it should follow these steps.

Step 1) Download the Nugget package Microsoft.Windows.SDK.Contracts

Step 2) Migrate to the package reference the SDK --> https://devblogs.microsoft.com/nuget/migrate-packages-config-to-package-reference/

The above mentioned SDK will provide you with the windows 10 speech recognition system within Win32 apps. This has to be done because the only way to use this speech recognition engine is to build a Universal Windows Platforms application. I don't recommend making an A.I. application in the Universal Windows Platform because it has sandboxing. The sandboxing function is isolating the app in a container and it won't allow it to communicate with any hardware and it will also make file access an absolute pain and thread management isn't possible, only async functions.

Step 3) Add this namespace in the namespace section. This namespace has all the functions that are related to online speech recognition.

using Windows.Media.SpeechRecognition;

Step 4) Add the speech recognition implementation.

Task.Run(async()=>
{
  try
  {
    
    var speech = new SpeechRecognizer();
    await speech.CompileConstraintsAsync();
    SpeechRecognitionResult result = await speech.RecognizeAsync();
    TextBox1.Text = result.Text;
  }
  catch{}
});

The majority of the methods within the Windows 10 SpeechRecognizer class require to be called asynchronously and this means that you must run them within am asynchronous method or an asynchronous Task method.

In order for this to work go to Settings -> Privacy -> Speech in the OS and check if the online speech recognition is allowed.

[ EDIT ]

In order for the speech recognizer to have a high degree of accuracy, a constraint should be set for the speech recognizer object. The constraint consists out of 2 proprieties, the topic constraint and the constraint probability. The topic constraint defines in which style the speech recognizer should approach the text to speech parsing ("dictation", "web search", "form filling", etc.). The one that I found that has the highest degree of accuracy and utility is the "web search" topic constraint. The constraint probability defines which level of accuracy is desired to be achieved from the speech recognition session.

An example of an accurate speech recognition engine implementation:

Windows.Media.SpeechRecognition.SpeechRecognitionTopicConstraint Constraints = new Windows.Media.SpeechRecognition.SpeechRecognitionTopicConstraint(Windows.Media.SpeechRecognition.SpeechRecognitionScenario.WebSearch, "web search");

using (Windows.Media.SpeechRecognition.SpeechRecognizer OnlineSpeechRecognition = new Windows.Media.SpeechRecognition.SpeechRecognizer())
{
    Constraints.Probability = Windows.Media.SpeechRecognition.SpeechRecognitionConstraintProbability.Max;
    OnlineSpeechRecognition.Constraints.Add(Constraints);
    Windows.Media.SpeechRecognition.SpeechRecognitionCompilationResult ConstratintsCompilation = await OnlineSpeechRecognition.CompileConstraintsAsync();

    OnlineSpeechRecognition.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(9);
    OnlineSpeechRecognition.Timeouts.EndSilenceTimeout = TimeSpan.FromSeconds(9);
    OnlineSpeechRecognition.Timeouts.InitialSilenceTimeout = TimeSpan.FromSeconds(9);


    Windows.Media.SpeechRecognition.SpeechRecognitionResult Result = await OnlineSpeechRecognition.RecognizeAsync();

    Console.WriteLine("You spoke: " + Result.Text);
}

The speech recogniser implementing this configuration will have a degree of accuracy similar or identical with the speech recogniser within Microsoft Word. For an example of an implementation of the online speech recognition engine visit this link: https://github.com/CSharpTeoMan911/Eva

Prajay Basu · Accepted Answer · 2021-06-08T19:42:31.670

This is expected. Word's dictation uses a cloud based, AI/ML assisted speech service: Azure Cognitive Services - Speech To Text. It is being constantly trained and updated for the best accuracy. You can easily test this by going offline and trying the dictation feature in Word - it won't work.

.NET's System.Speech uses the offline SAPI5 which hasn't been updated since Windows 7 as far as I'm aware. The core technology itself (Windows 95 era) is much older than what is available on today's phones or cloud based services. Microsoft.Speech.Recognition also uses similar core and won't be much better - although you can give it a try.

If you want to explore other offline options, I would suggest trying Windows.Media.SpeechRecognition. As far as I'm aware, it is the same technology as used by Cortana and other modern voice recognition apps on Windows 8 and up and does not use SAPI5.

It's pretty easy to find examples for Azure or Windows.Media.SpeechRecognition online, the best way to use the latter would be to update your app to .NET 5 and use C#/WinRT to access the UWP APIs.

Thanks. I've tried out Azure CognitiveServices. It was a bit of a hassle to set up but seems to be working well. I'd prefer something that works offline but not if the quality is poor. — Tim Cooper, Aug 08 '21 at 04:44
Is there a way to make Windows.Media.SpeechRecogntion works offline for continuous dictation? The sample from windows require internet connection. — Ali123, Aug 11 '21 at 10:37

Rekshino · Answer 3 · 2021-05-12T17:57:50.987

Your best bet I would say to use not a DictationGrammar but specific grammars with whole phrases or with key-values assignments:

private static SpeechRecognitionEngine CreateRecognitionEngine()
{
    var cultureInf = new System.Globalization.CultureInfo("en-US");

    var recoEngine = new SpeechRecognitionEngine(cultureInf);
    recoEngine.SetInputToDefaultAudioDevice();
            
    recoEngine.LoadGrammar(CreateKeyValuesGrammar(cultureInf, "weight", new string[] { "normal", "bold", "demibold" }));
    recoEngine.LoadGrammar(CreateKeyValuesGrammar(cultureInf, "color", new string[] { "red", "green", "blue" }));
    recoEngine.LoadGrammar(CreateKeyValuesGrammar(cultureInf, "size", new string[]{ "small", "medium", "large" }));

    recoEngine.LoadGrammar(CreateKeyValuesGrammar(cultureInf, "", new string[] { "Put whole phrase here", "Put whole phrase here again", "another long phrase" }));

    return recoEngine;
}

static Grammar CreateKeyValuesGrammar(CultureInfo cultureInf, string key, string[] values)
{
    var grBldr = string.IsNullOrWhiteSpace(key) ? new GrammarBuilder() { Culture = cultureInf } : new GrammarBuilder(key) { Culture = cultureInf };
    grBldr.Append(new Choices(values));

    return new Grammar(grBldr);
}

You may also try to use Microsoft.Speech.Recognition see What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition?

score 1 · Answer 4 · edited Sep 21 '22 at 07:34

1

A simple solution would be using the dictate function available in the Word Office 365. Rest all functionalities like grammar, language are taken care by dictate function.

To access dictate function in word office 365 use the below code.


Application.CommandBars.ExecuteMso(“Dictate”)

edited Sep 21 '22 at 07:34

D J

845
1
13
27

answered Sep 18 '22 at 07:13

Kumaresh

11
1

score 0 · Answer 5 · answered Apr 21 '21 at 10:39

0

As you are actually creating a voice user interface and not only doing speech recognition, you should check out Speechly. With Speechly it's a lot easier to create natural experiences that don't require hard-coded commands but rather support multiple ways of expressing the same thing. Integrating it to your application should be pretty simple, too. There's a small codepen on the front page to get a basic understanding.

answered Apr 21 '21 at 10:39

ottomatias

21
2

Does Speechly have a C# library? Can I use it similarly to the Microsoft library? – Tim Cooper Apr 22 '21 at 05:07

SpeechRecogntion quality is extremely poor especially compared to Word

5 Answers5

Linked