YouTube's auto captioning produces better results than Google Speech to Text API (Model: video, UseEnhanced: true). How can be this possible?

Question

Here my settings of Google Speech to Text AI

Here is the output file of Speech to Text AI : https://justpaste.it/speechtotext2

Here is the output file of YouTube's auto caption: https://justpaste.it/ytautotranslate

This is the video link : https://www.youtube.com/watch?v=IOMO-kcqxJ8&ab_channel=SoftwareEngineeringCourses-SECourses

This is the audio file of the video provided to Google Speech AI : https://storage.googleapis.com/text_speech_furkan/machine_learning_lecture_1.flac

Here I am providing time assigned SRT files

YouTube's SRT : https://drive.google.com/file/d/1yPA1m0hPr9VF7oD7jv5KF7n1QnV3Z82d/view?usp=sharing

Google Speech to Text API's SRT (timing assigned by YouTube) : https://drive.google.com/file/d/1AGzkrxMEQJspYenCbohUM4iuXN7H89wH/view?usp=sharing

I made comparison for some sentences and definitely YouTube's auto translation is better

For example

Google Speech to Text : Represent the **doctor** representation is one of the hardest part of computer AI you will learn about more about that in the future lessons.

What does this mean? Do you think this means that we are not just focused on behavior and **into doubt**. It is more about the reasoning when a human takes an action. There is a reasoning behind it.

YouTube's auto captioning : represent the **data** representation is one of the hardest part of computer ai you will we will learn more about that in the future lessons

what does this mean do you think this means that we are not just focused on behavior and **input** it is more about the reasoning when a human takes an action there is a reasoning behind it

I checked many cases and YouTube's guessing correct words is much better. How is this even possible?

This is the command I used to extract audio of the video : ffmpeg -i "input.mkv" -af aformat=s16:48000:output.flac

*How can be this possible?* This is just a guess, but even though Google owns YouTube, YouTube is a different organization than Google, with different researchers, and with completely different motivations for converting speech to text. — Gilbert Le Blanc, Oct 12 '20 at 18:50
@GilbertLeBlanc I totally get it. But the whole aim of Google Speech to Text API is providing the best results possible. Because it is a premium service not free. That is what shocking to me. — Furkan Gözükara, Oct 12 '20 at 19:51
All I can say is send Google an email with your research. I suspect that the reasons behind what you found show us that YouTube's political motivations are much stronger than Google's profit motive. — Gilbert Le Blanc, Oct 12 '20 at 22:58

score 2 · Answer 1 · answered Oct 13 '20 at 22:54

Both the automatic captions of the Youtube Auto Caption feature and the transcription of the Speech to Text Recognition are generated by machine learning algorithms, in which case the quality of the transcription may vary according to different aspects.

It is important to note that he Speech to Text API utilizes machine learning algorithms for its transcription, the ones that are improved over time and the results can vary according to the input file and the request configuration. One way of helping the models of Google transcription is by enabling data logging, this will allow Google to collect data from your audio transcription requests that will help to improve its machine learning models used for recognizing speech audio, including enhanced models.

Additionally, on the request configuration of the Speech to Text API, you can specify the RecognitionConfig settings. This parameter contains the encoding, sampleRateHertz, languageCode, maxAlternatives, profanityFilter and the speechContext, every parameter plays an important role on the accuracy of the transcription of the file.

Specifically for FLAC audio files, a lossless compression helps in the quality of the audio provided, since there is no degradation in quality of the original digital sample, FLAC uses a compression level parameter from 0 (fastest) to 8 (smallest file size).

Also, the Speech to Text API offers different ways to improve the accuracy of the transcription, such as:

Speech adaptation : This feature allows you to specify words and/or phrases that STT should recognize more frequently in your audio data
Speech adaptation boost : This feature allows allows you to add numerical weights to words and/or phrases according to how frequently they should be recognized in your audio data.
Phrases hints : Send a list of words and phrases that provide hints to the speech recognition task

These features might help you with the accuracy of the Speech to Text API recognizing your audio files.

Finally, please refer to the Speech to Text best practices to improve the transcription of your audio files, these recommendations are designed for greater efficiency and accuracy as well as reasonable response times from the API.

YouTube's auto captioning produces better results than Google Speech to Text API (Model: video, UseEnhanced: true). How can be this possible?

1 Answers1