I have an audio file that contains a spoken word. I am certain that it contains the word and I need to detect the beginning and end of the word.
Any ideas of how to do this using python?
Here's what I've done. I tried using a speech_recognition library in Python.
import speech_recognition as sr
r = sr.Recognizer()
with sr.WavFile("a.wav") as source:
audio = r.record(source)
try:
list = r.recognize_google(audio,key=None)
print list
except LookupError:
print("Could not understand audio")
This will transcribe the audio, but not provide time stamps of when the words occurred. I know I could chop my audio file into parts and keep feeding it through the Google speech recognizer until I get the part I want, but this seems like a terrible idea. I'm also envisioning cases where the transcription is not quite accurate, so the word that I am certain is in the file may not be transcribed accurately.
I tried pocketsphinx as well, but I was unsure of how to get it to provide the likely location of a word in the file (it transcribed the test file terribly).
Ideally, I would be searching for a function: find_likely_location_of_word(word) that returns a beginning timestamp and an ending timestamp.
I had thought this had to be something that had been done many times, so perhaps someone can at least point me in the right direction?