I would like to make a Speech-to-text-to-translation utility that can recognize and render both English and Spanish "on the fly." To start with, I need it to be able to process two languages (the translation piece of it I'll postpone until later).
IOW, I want it to be able to process (through the device's speaker) conversations such as:
Spanish speaker's voice captured and rendered: "Que estas haciendo?"
English speaker's voice captured and rendered: "I don't speak Spanish, or Italian, or whatever lingo that is. Speak English!"
Spanish speaker: "I asked you what you're doing."
English speaker: "Oh, not much really; I mean, none of your gol-durned business!"
(etc.)
I see here that I can set up a speech-to-text session like so:
using Microsoft.Speech.Recognition;
using Microsoft.Speech.Synthesis;
namespace ConsoleSpeech
{
class ConsoleSpeechProgram
{
static SpeechSynthesizer ss = new SpeechSynthesizer();
static SpeechRecognitionEngine sre;
static void Main(string[] args)
{
try
{
CultureInfo ci = new CultureInfo("en-us");
sre = new SpeechRecognitionEngine(ci);
sre.SetInputToDefaultAudioDevice();
sre.SpeechRecognized += sre_SpeechRecognized;
. . .
static void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
string txt = e.Result.Text;
float confidence = e.Result.Confidence;
Console.WriteLine("\nRecognized: " + txt);
if (confidence < 0.60) return;
. . .
Since the CultureInfo class is instantiated with a specific language (US English shown above), I guess it would render "Que estas haciendo?" as something like "Kay is toss hossy end, oh?" and therefore have a very low Result.Confidence value.
Is there a way to simultaneously respond to two languages, such as by instantiating two CultureInfo classes:
CultureInfo ciEnglish = new CultureInfo("en-us");
CultureInfo ciSpanish = new CultureInfo("es-mx");
Even if that is doable, would the two classes be "willing" to share the microphone and be smart enough to cede to the other when they don't understand what is being spoken?
I'm fearful that this is going to be one of those "hard" (read "pretty much impossible") challenges. If I'm wrong in that, please let me know, though.
In the answer by Bulltorious here, it would seem that possibly a "SpeechRecognized" event could try to determine the language being spoken, but not enough code is shown to see whether that is really so.