I am trying to achieve the following:
- Using Skype, call my mailbox (works)
- Enter password and tell the mailbox that I want to record a new welcome message (works)
- Now, my mailbox tells me to record the new welcome message after the beep
- I want to wait for the beep and then play the new message (doesn't work)
How I tried to achieve the last point:
- Create a spectrogram using FFT and sliding windows (works)
- Create a "finger print" for the beep
- Search for that fingerprint in the audio that comes from skype
The problem I am facing is the following:
The result of the FFTs on the audio from skype and the reference beep are not the same in a digital sense, i.e. they are similar, but not the same, although the beep was extracted from an audio file with a recording of the skype audio. The following picture shows the spectrogram of the beep from the Skype audio on the left side and the spectrogram of the reference beep on the right side. As you can see, they are very similar, but not the same...
uploaded a picture http://img27.imageshack.us/img27/6717/spectrogram.png
I don't know, how to continue from here. Should I average it, i.e. divide it into column and rows and compare the averages of those cells as described here? I am not sure this is the best way, because he already states, that it doesn't work very good with short audio samples, and the beep is less than a second in length...
Any hints on how to proceed?