2

I am currently developing a project of a DialerService. One of the functions is to interpret the recorded .wav media files into plaint text. I used the SpeechRecognitionEngine trying to interpret the contents, and I got some results that are not accurate or, sometimes broken sentences that doesn't make any sense.

The .wav files are the recorded files from a telephone conversation between two or more clients, the file I tested is a very simple and short conversation I made with my colleague.

So my question is how can I improve the accuracy of the interpretation and what to do to improve my code for this purpose? I know adding the grammar will help recognize some keywords, but what I need is to generally interpret contents that I recorded from users.

Here blow is my working code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Threading.Tasks;
using System.Speech.Recognition;
using System.Speech.AudioFormat;
using System.Web;

namespace VoiceRecognition
{
    class Program
    {

        static bool completed;

        static void Main(string[] args)
        {
            using (
             SpeechRecognitionEngine recognizer =
                    new SpeechRecognitionEngine(
                        new System.Globalization.CultureInfo("en-US")))
            {

                // Create and load a grammar.
                Grammar dictation = new DictationGrammar();
                dictation.Name = "Dictation Grammar";

                recognizer.LoadGrammar(new DictationGrammar());

                recognizer.SetInputToWaveFile(@"C:\Projects2\VoiceRecognition2\conf_with_vincent_1.wav");
                // Attach event handlers for the results of recognition.
                //recognizer.AudioLevelUpdated += new EventHandler<AudioLevelUpdatedEventArgs>(recognizer_AudioLevelUpdated);
                //recognizer.AudioStateChanged += new EventHandler<AudioStateChangedEventArgs>(recognizer_AudioStateChanged);

                recognizer.SpeechRecognized  +=  new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
                recognizer.RecognizeCompleted += new EventHandler<RecognizeCompletedEventArgs>(recognizer_RecognizeCompleted);

                // Perform recognition on the entire file.
                Console.WriteLine("Starting asynchronous recognition...");
                completed = false;
                //recognizer.RecognizeAsync();
                recognizer.RecognizeAsync(RecognizeMode.Multiple);

                // Keep the console window open.
                while (!completed)
                {
                    Console.ReadLine();
                }
                Console.WriteLine("Done.");
            }

            Console.WriteLine();
            Console.WriteLine("Press any key to exit...");
            Console.ReadKey();


        }

        // Handle the Audio state event.
        static void recognizer_AudioStateChanged(object sender, AudioStateChangedEventArgs e)
        {
            Console.WriteLine("The new audio state is: " + e.AudioState);
        }

        static void recognizer_AudioLevelUpdated(object sender, AudioLevelUpdatedEventArgs e)
        {
            Console.WriteLine("The audio level is now: {0}.", e.AudioLevel);
        }


        // Handle the SpeechRecognized event.
        static void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            if (e.Result != null && e.Result.Text != null)
            {
                Console.WriteLine("  Recognized text =  {0}", e.Result.Text);
            }
            else
            {
                Console.WriteLine("  Recognized text not available.");
            }
        }

        // Handle the RecognizeCompleted event.
        static void recognizer_RecognizeCompleted(object sender, RecognizeCompletedEventArgs e)
        {
            if (e.Error != null)
            {
                Console.WriteLine("  Error encountered, {0}: {1}",
                e.Error.GetType().Name, e.Error.Message);
            }
            if (e.Cancelled)
            {
                Console.WriteLine("  Operation cancelled.");
            }
            if (e.InputStreamEnded)
            {
                Console.WriteLine("  End of stream encountered.");
            }
            completed = true;
        }



    }
}

Another class is:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Speech.Recognition;

public class SpeechReconizer
{

    SpeechRecognitionEngine _speechRecognitionEngine;
    public SpeechReconitionResult ReadResult { get; set; }

    public SpeechReconizer()
    {
        Grammar dictation = new DictationGrammar();
        dictation.Name = "Dictation Grammar";



        _speechRecognitionEngine = new SpeechRecognitionEngine();
        _speechRecognitionEngine.SetInputToDefaultAudioDevice();
        _speechRecognitionEngine.LoadGrammar(dictation);
        _speechRecognitionEngine.InitialSilenceTimeout = TimeSpan.FromSeconds(3);
        _speechRecognitionEngine.BabbleTimeout = TimeSpan.FromSeconds(2);
        _speechRecognitionEngine.EndSilenceTimeout = TimeSpan.FromSeconds(1);
        _speechRecognitionEngine.EndSilenceTimeoutAmbiguous = TimeSpan.FromSeconds(1.5);
        _speechRecognitionEngine.RecognizeAsync(RecognizeMode.Multiple);
        _speechRecognitionEngine.SpeechRecognized += RecognizerSpeechRecognized;
        _speechRecognitionEngine.RecognizeCompleted += RecognizerRecognizeCompleted;
    }



    public SpeechReconitionResult ReadSpeech(string sourceAudio)
    {
        ReadResult = new SpeechReconitionResult();

        _speechRecognitionEngine.SetInputToWaveFile(sourceAudio);


        _speechRecognitionEngine.Recognize();
        return ReadResult;

    }

    private void RecognizerSpeechRecognized(object sender, SpeechRecognizedEventArgs e)
    {
        if (e.Result != null && e.Result.Text != null)
        {
            ReadResult.Success = true;
            ReadResult.Text = e.Result.Text;
        }
        else
        {
            ReadResult.Text = "Recognized text not available.";
        }
    }

    private void RecognizerRecognizeCompleted(object sender, RecognizeCompletedEventArgs e)
    {
        if (e.Error != null)
        {
            ReadResult.Success = false;
            ReadResult.ErrorMessage = string.Format("{0}: {1}",
                          e.Error.GetType().Name, e.Error.Message);
        }
        if (e.Cancelled)
        {
            ReadResult.Success = false;
            ReadResult.ErrorMessage = "Operation cancelled.";
        }
    }

}
public class SpeechReconitionResult
{
    public string Text { get; set; }
    public bool Success { get; set; }
    public string ErrorMessage { get; set; }
    public bool Complete { get; set; }
}

The test result is(in console):

Starting asynchronous recognition...
  Recognized text = Helence and the globe or east
  Recognized text = alarmed
  Recognized text = and client thanks
  Recognized text = what aren't going to do and that they
  Recognized text = aren't goint to rule
  Recognized text = working to dear E
  Recognized text = N
  Recognized text = at dinner
  Recognized text = and
  Recognized text = that going there
  Recognized text = and you have a 98 no problem bars
  End of stream encountered.

What the actual content is: -Hello Vincent. -Hello Boris. -How are you? -I am fine. -What are you going to do today? -I am going to watch TV, have dinner and go home. -Thank you, have a nice day. -No problem.

xiaoy23
  • 23
  • 1
  • 6

2 Answers2

6

System.Speech.Recognition powers the default windows speech recognition. It is designed for a single user and can be trained by the user through the windows speech recognition training.

What you probably want is the Microsoft.Speech.Recognition library which is designed for lower quality audio. It works almost the same way, however, it is not designed for dictation. It is made more for detecting commands from telephone quality audio. If you would like to give it a shot the latest version I found is here: http://www.microsoft.com/en-us/download/details.aspx?id=27226

Krikor Ailanjian
  • 1,842
  • 12
  • 17
  • Thank you very much Kailanjian! So it can detect keywords instead of long conversation,correct? If so, how should I implement such a function, what other Libraries can I use to give me a more accurate interpretation? I wanted to use Google Speech Recognition API but it seems to be restricted. This is the first time I'm trying to make such a function, please advice. – xiaoy23 Jul 30 '15 at 19:17
  • Glad to help! Unfortunately, this is as far as I can help you. I only know about this because I made a small speech project for myself using the Windows speech recognition. I did find another stack overflow answer that might be able to help you: http://stackoverflow.com/questions/12721436/google-speech-api. Good luck – Krikor Ailanjian Jul 30 '15 at 23:56
0

There are actually several approaches to this that I have tested using C#. One of them is SrgsToken.Pronunciation.Property This essentially allows you to create the "Slang Rule" from the SrgsOneOf objects. Say you have a set of commands such as "abandon," but the person is speaking it "abanon". You can actually create a SrgsOneOf abandon = new SrgsOneOf(new string[] { "abandan", "abandin", "ah'bandon", abanon });

Create the "Slang Rule" from the SrgsOneOf objects.SrgsRule slangRule = new SrgsRule("slang", abandon);

Also, In C#, an enum (or enumeration type) is used to assign constant names to a group of numeric integer values. It makes constant values more readable. I applied this to my switch case statements and since then, the response has been more accurate and cleaner.

Last: Its essential to practice good grammar building, invest in an expensive microphone/good quality. Another factor that many overlook are string usage and the memory it can consume. The major point in using constant strings "seen below" is that a constant string is automatically interned. If you have 1000 instances of a type that has a regular string field and all instances store the same string that will never be changed then 1000 equal string instances will be stored, unnecessarily blowing up the memory profile of your application. If you declare the string constant, "it will only consume memory once". This is the same behavior as using the string literal directly. In contrast to a static read only string the value of constant string is stored directly in the referencing class.

Lets say I have Choices:
Choices commands = new Choices(); commands.Add(new string[] { "scarlett find biochemistry"});

How do I take a smarter approach? Answer: private const string V = "scarlett find biochemistry"; Now: Choices commands = new Choices(); commands.Add(new string[] { V });

This has greatly improved my system seen here at YouTube.com. Scarlett Extreme