1st Question
Yes you can parse an audio file and from the transcription of it you can fetch each word with its timestamps.
To get the timestamp value for each word you need to specify enable_word_time_offsets
as True
in speech configuration. According to the doc:
Time offsets are especially useful for analyzing longer audio files, where you may need to search for a particular word in the recognized text and locate it (seek) in the original audio. Speech-to-Text supports time offsets for all speech
Below is an example of fetching each word with all of their occurrences using Pyhton
.
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=8000,
language_code="en-US",
#enabling the time offset
enable_word_time_offsets=True
)
response = client.recognize(config=config, audio=audio)
#storing each word in the dictionary, timestamp will be stored as a list for each word.
#eg :{"word": [[0.0,0.2],[0.2,0.4],......]}
words_dict={};
for result in response.results:
alternative = result.alternatives[0]
for word_info in alternative.words:
word = word_info.word
start_time = word_info.start_time.total_seconds()
end_time = word_info.end_time.total_seconds()
# capturing each word's occurrences in the transcript with their timestamps.
timestamp_list = [start_time,end_time]
if(word in words_dict):
words_dict[word].append(timestamp_list)
else:
words_dict[word] = [timestamp_list]
search_word ='YOUR SEARCH WORD'
print(words_dict[search_word])
I have used a dictionary
to store each word and for each word there is a list
which contains word.start_time
and word.end_time
of each occurrence.
You can follow this to transcribe short audio files or this to transcribe long audio files.
2nd Question
To add new words or phrases you need to use speech adaptation in your API requests.This feature allows you to provide additional context to your recognition request giving phrases or classes that can help the recognition.
Expanding the vocabulary of words recognized by Speech-to-Text. Speech-to-Text includes a very large vocabulary. However, if your audio data often contains words that are rare in general language use (such as proper names or domain-specific words), you can add them using speech adaptation.
You can also go through this Stack Overflow post which has already given a good brief explanation.