I am working on a speech recognition system project. I have used deep neural network to do the speech recognition. But I also need the starting and end timings of the words occuring in the given speech. Can you suggest me or direct me towards resources to solve the problem of timestamp generation in speech recognition ? I know the Amazon transcribe service does the timestamp generation too but I haven't been able to get the papers about this.
1 Answers
If you're interested in trying Microsoft's speech service (https://aka.ms/speech/sdk) we do support word level timestamps as well. You can start with one of our quick start samples (available in many programming languages), and you can a couple more lines of code to get the word level timing information.
Basically, after trying out the default microphone quickstart or file quickstart, you can add a couple lines of code to request the word level timestamps. And you'll add another line of code to retrieve the service provided json response (which has the word level timing information).
For example, in C#, you'd do this for your SpeechConfig
object:
config.OutputFormat = OutputFormat.Detailed;
config.RequestWordLevelTimestamps = true;
And once you've received your SpeechRecognitionResult
object, you'd do this:
var json = result.Properties.GetProperty(PropertyId.SpeechServiceResponse_JsonResult);
Console.WriteLine(json);
If you're using another supported programming language (C++, Java, JavaScript, Objective-C, Swift, Python, etc.), the code would be slightly different.
Good luck.
Rob Chambers, Microsoft
Architect and Engineering Manager

- 111
- 2