24

It's possible to use Google's Speech recognition API to get a transcription for an audio file (WAV, MP3, etc.) by doing a request to http://www.google.com/speech-api/v2/recognize?...

Example: I have said "one two three for five" in a WAV file. Google API gives me this:

{
  u'alternative':
  [
    {u'transcript': u'12345'},
    {u'transcript': u'1 2 3 4 5'},
    {u'transcript': u'one two three four five'}
  ],
  u'final': True
}

Question: is it possible to get the time (in seconds) at which each word has been said?

With my example:

['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.

i.e. the word "one" has been said between time 00:00:00.23 and 00:00:00.80,
the word "two" has been said between time 00:00:01.03 and 00:00:01.45 (in seconds).

PS: looking for an API supporting other languages than English, especially French.

Jeankowkow
  • 814
  • 13
  • 33
Basj
  • 41,386
  • 99
  • 383
  • 673

3 Answers3

15

I believe the other answer is now out of date. This is now possible with the Google Cloud Search API: https://cloud.google.com/speech/docs/async-time-offsets

deweydb
  • 2,238
  • 2
  • 30
  • 37
13

EDIT 2020: Now possible, see the other answers

It is not possible with google API.

If you want word timestamps, you can use other APIs, for example:

Vosk-API - free offline speech recognition API (disclosure: I am the primary author of Vosk).

SpeechMatics SaaS speech recognition API

Speech Recognition API from IBM

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • Thanks! Have you tried these 3 APIs? Are they as good as Google's ? I am amazed each day of how Google's speech recognition is powerful. (I speak (loud) my text messages to my Android phone, and the phone makes nearly no mistake at all !) – Basj Dec 04 '15 at 12:44
  • They should be comparable in terms of accuracy. – Nikolay Shmyrev Dec 04 '15 at 13:51
  • It seems that none of them supports French language, sadly. – Basj Jan 30 '16 at 16:26
  • 6
    We tried IBM BlueMix Speech API for exactly this purpose and found the accuracy to be abysmal. Even simple clearly-spoken isolated words like "spoon" would come back as "moon", "room", "doom", "bloom", "whom". And this was after I pre-specified the keyword set to ("spoon") with a low acceptance probability. As the OP mentioned IBM does provide start and stop times for each word (which Google apparently does not), however the accuracy was too low to be usable. – Hephaestus Feb 11 '17 at 06:03
  • @Hephaestus, which vendor did you find provides the highest accuracy? Google? – Andy Sep 14 '22 at 20:52
9

Yes, it is very much possible. All you need to do is:

In the config set enable_word_time_offsets=True

config = types.RecognitionConfig(
        ....
        enable_word_time_offsets=True)

Then, for each word in the alternative, you can print its start time and end time as in this code:

for result in result.results:
        alternative = result.alternatives[0]
        print(u'Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}'.format(alternative.confidence))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9))

This would give you output in the following format:

Transcript:  Do you want me to give you a call back?
Confidence: 0.949534416199
Word: Do, start_time: 1466.0, end_time: 1466.6
Word: you, start_time: 1466.6, end_time: 1466.7
Word: want, start_time: 1466.7, end_time: 1466.8
Word: me, start_time: 1466.8, end_time: 1466.9
Word: to, start_time: 1466.9, end_time: 1467.1
Word: give, start_time: 1467.1, end_time: 1467.2
Word: you, start_time: 1467.2, end_time: 1467.3
Word: a, start_time: 1467.3, end_time: 1467.4
Word: call, start_time: 1467.4, end_time: 1467.6
Word: back?, start_time: 1467.6, end_time: 1467.7

Source: https://cloud.google.com/speech-to-text/docs/async-time-offsets

Ishmeet Kaur
  • 137
  • 1
  • 7