Azure Speech To Text - Extract Phonemes from spoken words

Question

I'm trying to create a tool using Azure Text-To-Speech that will be sure to pronounce a user's name correctly when speaking the text provided.

The way I figured to do this is by:

Having the user speak their name within the context of a few sentences
Having Azure extract the Phonemes of the name, specifically, from those spoken words
Add those extracted Phonemes back into the SSML for the text-to-speech portion later on

Looking at the docs, it seems the only place Azure gives you back Phonemes is via the PronunciationAssessmentResult class. So, I wrote the following code to try and extract it:

using (var recognizer = new Microsoft.CognitiveServices.Speech.SpeechRecognizer(config))
{
    var proununciationAssessmentConfig = new PronunciationAssessmentConfig(requestedStatement);
    proununciationAssessmentConfig.ApplyTo(recognizer);

    var reply = await recognizer.RecognizeOnceAsync();

    var result = PronunciationAssessmentResult.FromResult(reply);

    foreach (var word in result.Words)
    {
        // Here I can add in code to look at word.Word / word.Phonemes
    }
}

The problem I'm having here is that it seems the PronunciationAssessmentResult.FromResult(reply) returns back the Phonemes Azure, itself, creates from the statement provided, NOT the ones extracted from the spoken words. (In other words, it will create the Phonemes it expects, then grades the spoken words for how close the pronunciation is to those Phonemes) - Which is not what I'm looking to do here.

I did find a workaround using System.Speech.Recognition (based upon answers given here to a question posed in 2013), but, then, I am limited to only Windows machines as well as (I'm assuming here), a less accurate speech engine than the ones currently out there.

Is there a way to do this using either Azure / Google / some other well-developed speech engine?

ANY sample code would really be helpful!

Maybe this is helpful: https://stackoverflow.com/questions/55709486/how-to-use-phonetic-or-phoneme-pronunciation-in-google-text-to-speech — Nikasha Von carstein, Jul 04 '21 at 00:35
@nikasha, thanks, but, unfortunately, that is the opposite direction of what I need. I'm looking to be able to input spoken speech and have the system break it down into its phonemes. Thanks so much, though. — John Bustos, Jul 05 '21 at 15:05

Azure Speech To Text - Extract Phonemes from spoken words

0 Answers0