Google Cloud Speech-to-Text (MP3 to text)

Question

I am using Google Cloud Platform Speech-to-Text API trial account service. I am not able to get text from an audio file. I do not know what exact encoding and sample Rate Hertz I should use for MP3 file of bit rate 128kbps. I tried various options but I am not getting the transcription.

const speech = require('@google-cloud/speech');

const config = {
  encoding: 'LINEAR16',  //AMR, AMR_WB, LINEAR16(for wav)
  sampleRateHertz: 16000,  //16000 giving blank result.
  languageCode: 'en-US'
};

Grokify · Answer 1 · 2019-09-16T11:30:30.160

MP3 is now supported in beta:

MP3 Only available as beta. See RecognitionConfig reference for details.

https://cloud.google.com/speech-to-text/docs/encoding

MP3 MP3 audio. Support all standard MP3 bitrates (which range from 32-320 kbps). When using this encoding, sampleRateHertz can be optionally unset if not known.

https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig#AudioEncoding

You can find out the sample rate using a variety of tools such as iTunes. CD-quality audio uses a sample rate of 44100 Hertz. Read more here:

https://en.wikipedia.org/wiki/44,100_Hz

To use this in a Google SDK, you may need to use one of the beta SDKs that defines this. Here is the constant from the Go Beta SDK:

RecognitionConfig_MP3 RecognitionConfig_AudioEncoding = 8

https://godoc.org/google.golang.org/genproto/googleapis/cloud/speech/v1p1beta1

I used the betaversion for a mp3 file whose sample rate is 44100 Hz(found it using sox)...but if i use it the api translates only the first word...whereas if i use sample rate as 8000...the api translates properly...no such issue when i use with azure speech to text API — Nitin, Jun 03 '20 at 05:59

score 3 · Answer 2 · answered Nov 28 '18 at 13:42

3

According to the official documentation (https://cloud.google.com/speech-to-text/docs/encoding),

Only the following formats are supported:

FLAC
LINEAR16
MULAW
AMR
AMR_WB
OGG_OPUS
SPEEX_WITH_HEADER_BYTE

Anything else will be rejected.

Your best bet is to convert the MP3 file to either:

FLAC. .NET: How can I convert an mp3 or a wav file to .flac
Wav and use LINEAR16 in that case. You can use NAudio. Converting mp3 data to wav data C#

Honestly it is annoying that Google does not support MP3 from the get-go compared to Amazon, IBM and Microsoft who do as it forces us to jump through hoops and also increase the bandwidth usage since FLAC and LINEAR16 are lossless and therefore much bigger to transmit.

answered Nov 28 '18 at 13:42

Pic Mickael

1,244
19
36

What is the format to get text from m4a file. I am using below but it FAILED and returning empty result. {@"encoding":@"MULAW", @"sampleRateHertz":@(16000), @"languageCode":@"en-IN", @"maxAlternatives":@30} – CrazyPro007 May 01 '19 at 07:36
URL is NSString *service = @"https://speech.googleapis.com/v1/speech:recognize"; – CrazyPro007 May 01 '19 at 08:41

score 2 · Answer 3 · answered Oct 17 '18 at 14:43

2

I had the same issue and resolved it by converting it to FLAC.

Try converting your audio to FLAC and use

encoding: 'FLAC',

For conversion, you can use sox ref: https://www.npmjs.com/package/sox

answered Oct 17 '18 at 14:43

Rejo Chandran

599
4
22

bob tian · Answer 4 · 2022-05-28T02:50:56.900

now, the mp3 type for spedch-to-text,only available in module speech_v1p1beta1 ,you must post your request for this module,and you will get what you want. the encoding: 'MP3' python example like this:

from google.cloud import speech_v1p1beta1 as speech
import io
import base64

client = speech.SpeechClient()
speech_file = "your mp3 file path"
with io.open(speech_file, "rb") as audio_file:
    content = (audio_file.read())

audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.MP3,
    sample_rate_hertz=44100,
    language_code="en-US",
)

response = client.recognize(config=config, audio=audio)

# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
print(response)
for result in response.results:
    # The first alternative is the most likely one for this portion.
    print(u"Transcript: {}".format(result.alternatives[0].transcript))

result

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, May 27 '22 at 00:54

Google Cloud Speech-to-Text (MP3 to text)

4 Answers4