I'm trying to create a tool using Azure Text-To-Speech that will be sure to pronounce a user's name correctly when speaking the text provided.
The way I figured to do this is by:
- Having the user speak their name within the context of a few sentences
- Having Azure extract the Phonemes of the name, specifically, from those spoken words
- Add those extracted Phonemes back into the SSML for the text-to-speech portion later on
Looking at the docs, it seems the only place Azure gives you back Phonemes is via the PronunciationAssessmentResult
class. So, I wrote the following code to try and extract it:
using (var recognizer = new Microsoft.CognitiveServices.Speech.SpeechRecognizer(config))
{
var proununciationAssessmentConfig = new PronunciationAssessmentConfig(requestedStatement);
proununciationAssessmentConfig.ApplyTo(recognizer);
var reply = await recognizer.RecognizeOnceAsync();
var result = PronunciationAssessmentResult.FromResult(reply);
foreach (var word in result.Words)
{
// Here I can add in code to look at word.Word / word.Phonemes
}
}
The problem I'm having here is that it seems the PronunciationAssessmentResult.FromResult(reply)
returns back the Phonemes Azure, itself, creates from the statement provided, NOT the ones extracted from the spoken words.
(In other words, it will create the Phonemes it expects, then grades the spoken words for how close the pronunciation is to those Phonemes) - Which is not what I'm looking to do here.
I did find a workaround using System.Speech.Recognition
(based upon answers given here to a question posed in 2013), but, then, I am limited to only Windows machines as well as (I'm assuming here), a less accurate speech engine than the ones currently out there.
Is there a way to do this using either Azure / Google / some other well-developed speech engine?
ANY sample code would really be helpful!