0

Using Google Speech Api or Google Cloud Speech, is it possible to:

  1. Parse an audio file and locate the exact point\time(s) within the file that a specific word is being spoken.

  2. Add new words (not recognized in existing languages) to the dictionary, so it would be possible to search for these words in the file.

If not, are there other technologies to consider?

Thanks

user1052610
  • 4,440
  • 13
  • 50
  • 101

1 Answers1

0

1st Question

Yes you can parse an audio file and from the transcription of it you can fetch each word with its timestamps.

To get the timestamp value for each word you need to specify enable_word_time_offsets as True in speech configuration. According to the doc:

Time offsets are especially useful for analyzing longer audio files, where you may need to search for a particular word in the recognized text and locate it (seek) in the original audio. Speech-to-Text supports time offsets for all speech

Below is an example of fetching each word with all of their occurrences using Pyhton.

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=8000,
    language_code="en-US",
    #enabling the time offset
    enable_word_time_offsets=True
)
response = client.recognize(config=config, audio=audio)

#storing each word in the dictionary, timestamp will be stored as a list for each word.
#eg :{"word": [[0.0,0.2],[0.2,0.4],......]}
words_dict={};
for result in response.results:
    alternative = result.alternatives[0]

    for word_info in alternative.words:
        word = word_info.word
        start_time = word_info.start_time.total_seconds()
        end_time = word_info.end_time.total_seconds()
        
        # capturing each word's occurrences in the transcript with their timestamps.         
        timestamp_list = [start_time,end_time]
        if(word in words_dict):
            words_dict[word].append(timestamp_list)
        else:
            words_dict[word] = [timestamp_list]

search_word ='YOUR SEARCH WORD'
print(words_dict[search_word])

I have used a dictionary to store each word and for each word there is a list which contains word.start_time and word.end_time of each occurrence.

You can follow this to transcribe short audio files or this to transcribe long audio files.

2nd Question

To add new words or phrases you need to use speech adaptation in your API requests.This feature allows you to provide additional context to your recognition request giving phrases or classes that can help the recognition.

Expanding the vocabulary of words recognized by Speech-to-Text. Speech-to-Text includes a very large vocabulary. However, if your audio data often contains words that are rare in general language use (such as proper names or domain-specific words), you can add them using speech adaptation.

You can also go through this Stack Overflow post which has already given a good brief explanation.

Sayan Bhattacharya
  • 1,365
  • 1
  • 4
  • 14