Speech-to-text large audio files [Microsoft Speech API]

Question

What is the best way to transcribe medium/large audio files, ~ 6-10 mins each file, using Microsoft Speech API? Something like batch audio files transcription?

I have used the code provided in https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-to-text-sample, for continuously transcribing speech, but it stops transcribing at some point. Is there any restriction on the transcription? I am only using the free trial account atm.

Btw, I assume there is no difference between Bing Speech API and the new Speech service API, right?

Thanks everyone!

Could you share your code @Blue482? I would like to see it :-) — Beckenbaur93, Nov 25 '19 at 09:01

score 4 · Accepted Answer · answered Jun 19 '18 at 18:05

thank you for your feedback.

I agree the sample (and the documentation you are looking at) is not very clear, we will update this soon.

The sample uses RecognizeAsync, and it should be call RecognizeOnceAsync. It is currently just trying to return the FIRST FinalResult from the service. You should use Start/StopRecognizeAsync, and register to receive Result events.

Again, sorry for the bad documentation here, we will update this soon, and also will rename the API probably in a refresh.

If you have audio files, you could also use the batch transcription feature. Perhaps that helps? https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/batch-transcription

Cheers Wolfgang

Thanks Wolfgane! I have got it sorted :). Would you like to answer or comment on my another question: https://stackoverflow.com/questions/50822466/difference-among-microsoft-speech-products-platforms ? Thanks a lot! — Blue482, Jun 19 '18 at 19:24

score 1 · Answer 2 · answered Jun 11 '18 at 16:07

1

The Speech services allow 5,000 transactions per month, 20 per minute during the free trial so maybe at some point you exceed the 20 per minute limit because of real-time continuous recognition.

answered Jun 11 '18 at 16:07

Ali Heikal

3,790
3
18
24

Thanks Ali! Is there any limit on the length of the audio per transaction? Essentially I want to transcribe a 5-10 mins long audio recording. – Blue482 Jun 11 '18 at 16:43
1

You're welcome, the REST API requests may contain up to 10 seconds of audio and last a maximum of 14 seconds overall and your total transactions cannot exceed a total of 5 hours per month using the free tier, otherwise you'll have to upgrade to a paid tier. – Ali Heikal Jun 11 '18 at 21:13
Thanks. But I am using the Client C# Libraries. Does it have any limit? – Blue482 Jun 11 '18 at 21:22
Can I also ask please Ali that does Microsoft Speech Service API and Bing API support multi speakers? And does it include timestamps or timecoded output? Thanks again, appreciated. – Blue482 Jun 11 '18 at 21:24
1

It recognizes any phrase that is being said in an audio file, if you want to identify the speakers by name, then you'd have to use Speaker Recognition API too, and for the timestamps you can actually handle that on your side as the transcription response contain `Offset` which specifies the offset at which a phrase was recognized, relative to the start of the audio stream and `Duration` which specifies the duration of this speech phrase. – Ali Heikal Jun 11 '18 at 21:56
Thanks Ali. Are you suggesting to use Speaker Recognition API first then segment audio streams to ensure one stream per speaker. And then feed these streams to Speech-to-text API? Or is there any more streamlined or smooth-fashioned way to perform all this? – Blue482 Jun 11 '18 at 22:55
1

You can feed the streams to the Speech to Text API, then chunk the audio according to the returned `Offset` and `Duration` of each phrase, then send those chunks to the Speaker Recognition API to identify the speaker by name so you'd have a name for each chunk to put with it's transcribed phrase and create a dialog out of. – Ali Heikal Jun 11 '18 at 23:25
Thanks Ali. I think the prerequisite is I have to enrol all the speakers first right? before using the Speaker Recognition API.. Another question is do you know if Microsoft has an offline speech-to-text model I can use? – Blue482 Jun 12 '18 at 09:58
1

Yes, and you can use the Speech Recognizer Class for offline. – Ali Heikal Jun 12 '18 at 12:30
Without using subscription key? Where can I get the class do you know Ali? – Blue482 Jun 12 '18 at 15:12
1

Yes, it is offline, just search for Windows.Media.SpeechRecognition – Ali Heikal Jun 12 '18 at 15:25
Thanks Ali!! What's the difference between this offline one and their Speech API do you know? – Blue482 Jun 12 '18 at 15:29
1

The API is obviously way more efficient and constantly improved. – Ali Heikal Jun 12 '18 at 15:31

Speech-to-text large audio files [Microsoft Speech API]

2 Answers2

Linked