I have a .wav file that I recorded my own voice and speak for several minutes. Let's say I want to find the exact times that I said "Mike" in the audio. I looked into speech recognition and made some tests with Google Speech API, but the timestamps I got back were far from accurate.
As an alternative, I recorded a very short .wav file that I just said "Mike". I am trying to compare these two .wav files and find every timestamp that "Mike" was said in the longer .wav file. I came across SleuthEye's amazing answer
This code works perfectly well to find just one timestamp, but I couldn't figure out how to find out multiple start/end times:
import numpy as np
import sys
from scipy.io import wavfile
from scipy import signal
snippet = sys.argv[1]
source = sys.argv[2]
# read the sample to look for
rate_snippet, snippet = wavfile.read(snippet);
snippet = np.array(snippet, dtype='float')
# read the source
rate, source = wavfile.read(source);
source = np.array(source, dtype='float')
# resample such that both signals are at the same sampling rate (if required)
if rate != rate_snippet:
num = int(np.round(rate*len(snippet)/rate_snippet))
snippet = signal.resample(snippet, num)
# compute the cross-correlation
z = signal.correlate(source, snippet);
peak = np.argmax(np.abs(z))
start = (peak-len(snippet)+1)/rate
end = peak/rate
print("start {} end {}".format(start, end))