2

I have an audio file that contains a spoken word. I am certain that it contains the word and I need to detect the beginning and end of the word.

Any ideas of how to do this using python?

Here's what I've done. I tried using a speech_recognition library in Python.

import speech_recognition as sr
r = sr.Recognizer()
with sr.WavFile("a.wav") as source:              
    audio = r.record(source)                        

try:
    list = r.recognize_google(audio,key=None)                 
    print list
except LookupError:                                
    print("Could not understand audio")

This will transcribe the audio, but not provide time stamps of when the words occurred. I know I could chop my audio file into parts and keep feeding it through the Google speech recognizer until I get the part I want, but this seems like a terrible idea. I'm also envisioning cases where the transcription is not quite accurate, so the word that I am certain is in the file may not be transcribed accurately.

I tried pocketsphinx as well, but I was unsure of how to get it to provide the likely location of a word in the file (it transcribed the test file terribly).

Ideally, I would be searching for a function: find_likely_location_of_word(word) that returns a beginning timestamp and an ending timestamp.

I had thought this had to be something that had been done many times, so perhaps someone can at least point me in the right direction?

RaTh0D
  • 323
  • 3
  • 19
Anthony Tyler
  • 157
  • 2
  • 11
  • Possible duplicate of [Keyword Spotting in Speech](https://stackoverflow.com/questions/5184233/keyword-spotting-in-speech) – Nikolay Shmyrev Jun 18 '17 at 09:18
  • There are also APIs which return timestamps, for example IBM Watson. It is about speed/accuracy balance. Transcription is slow but more accurate, spotting is faster and more robust, but prone to false alarms. – Nikolay Shmyrev Jun 18 '17 at 09:30

1 Answers1

0

let the word you are trying to find be called "var"

    import speech_recognition as sr
r = sr.Recognizer()
with sr.WavFile("a.wav") as source:              
    audio = r.record(source)                        

try:
    list = r.recognize_google(audio,key=None)                 
    if var in list:
        print("word found")

except LookupError:                                
    print("Could not understand audio")