5

I am trying to achieve the following:

  • Using Skype, call my mailbox (works)
  • Enter password and tell the mailbox that I want to record a new welcome message (works)
  • Now, my mailbox tells me to record the new welcome message after the beep
  • I want to wait for the beep and then play the new message (doesn't work)

How I tried to achieve the last point:

  • Create a spectrogram using FFT and sliding windows (works)
  • Create a "finger print" for the beep
  • Search for that fingerprint in the audio that comes from skype

The problem I am facing is the following:
The result of the FFTs on the audio from skype and the reference beep are not the same in a digital sense, i.e. they are similar, but not the same, although the beep was extracted from an audio file with a recording of the skype audio. The following picture shows the spectrogram of the beep from the Skype audio on the left side and the spectrogram of the reference beep on the right side. As you can see, they are very similar, but not the same...
uploaded a picture http://img27.imageshack.us/img27/6717/spectrogram.png

I don't know, how to continue from here. Should I average it, i.e. divide it into column and rows and compare the averages of those cells as described here? I am not sure this is the best way, because he already states, that it doesn't work very good with short audio samples, and the beep is less than a second in length...

Any hints on how to proceed?

Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443

2 Answers2

4

You should determine the peak frequency and duration (possibly a minumum power over that duration for the frequency (RMS being the simplest measure)

This should be easy enough to measure. To make things even more clever (but probably completely unnecessary for this simple matching task), you could assert the non-existance of other peaks during the window of the beep.

Update

To compare a complete audio fragment, you'll want to use a Convolution algorithm. I suggest using a ready made library implementation instead of rolling your own.

The most common fast convolution algorithms use fast Fourier transform (FFT) algorithms via the circular convolution theorem. Specifically, the circular convolution of two finite-length sequences is found by taking an FFT of each sequence, multiplying pointwise, and then performing an inverse FFT. Convolutions of the type defined above are then efficiently implemented using that technique in conjunction with zero-extension and/or discarding portions of the output. Other fast convolution algorithms, such as the Schönhage–Strassen algorithm, use fast Fourier transforms in other rings.

Wikipedia lists http://freeverb3.sourceforge.net as an open source candidate

Edit Added link to API tutorial page: http://freeverb3.sourceforge.net/tutorial_lib.shtml

Additional resources:

http://en.wikipedia.org/wiki/Finite_impulse_response

http://dspguru.com/dsp/faqs/fir

Existing packages with relevant tools on debian:

[brutefir - a software convolution engine][3]
jconvolver - Convolution reverb Engine for JACK

libzita-convolver2 - C++ library implementing a real-time convolution matrix
teem-apps - Tools to process and visualize scientific data and images - command line tools
teem-doc - Tools to process and visualize scientific data and images - documentation
libteem1 - Tools to process and visualize scientific data and images - runtime

yorick-yeti - utility plugin for the Yorick language
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Thanks for your answer. Matching the beep is only the first step. In a second step I want to ask my mailbox to replay the new message and compare it with the file I played when the mailbox was waiting for the new message. Will your answer still work with this? I don't really see, how it can work, when I only look for one peek frequency... – Daniel Hilgarth Apr 30 '11 at 18:36
  • Respone updated inline. This [course guide](http://ptolemy.eecs.berkeley.edu/eecs20/labs/LabVIEW_Labs/Lab04/Lab04.pdf) will probably be a good backgrounder/getting started – sehe Apr 30 '11 at 19:21
  • Thanks for the update. Looks like I have to invest a little bit more time than I thought... ;-) It's hard for me to understand this stuff, because I have no background in DSP... – Daniel Hilgarth Apr 30 '11 at 19:34
  • @sehe: Can you please elaborate a little bit on the Convolution algorithm? Using google, I wasn't even able to learn, how it would help me with my problem... – Daniel Hilgarth Apr 30 '11 at 20:12
  • Mmmm it should be pretty clear from the animation on the wikipedia link. I'm no sound processing guy, so I'm not sure I should be teaching you stuff I hardly know about myself :) In simple terms, a convolution function is a response function that will give a peak at the point (in time) where two source signals have the most similarity in curve. Convolutions can be much smarter than that, detecting scaled copies and repeats in both axes etc. but you won't need that. You'd only want to look for a 'positive match' peak at one moment in time, and judge whether it exceeds your required threshold :) – sehe Apr 30 '11 at 20:16
  • @sehe: I see. I was a little bit confused, because the animation was only for 2D data, but I have 3D (time, frequency, amplitude). I guess, you are suggesting, that I convolute the sounds in the time domain? If I understand it correctly, I will have a "lot" of "very high" peaks where the two samples match. How exactly does this help me? – Daniel Hilgarth Apr 30 '11 at 20:24
  • You have time and amplitude only. Frequency is implicit: it is the time-domain representation of the same data. Only for visualization is it necessary/convenient to join the three units in a chart. -- No, you won't have a lot of high peaks. If you look for a sample, you'll get exactly ONE peak where it matches (and the rest will be sloping up and down to and from that point). If there are similar points in the audio, they might get sub-peaks, but they do not concern you. **By the way have updated with more existing libraries/apps** – sehe Apr 30 '11 at 20:33
  • @sehe: Thanks for your time and effort. However, I fear, it didn't really help me, because I am missing so many basics. For instance, I don't understand, how the frequency is implicit, when I only have the amplitude at a given point in time. Besides, When using a sliding window for the FFT, I have an array of doubles for every point in time, where each value in the array represents the magnitude of that frequency at that specific point in time... – Daniel Hilgarth Apr 30 '11 at 20:53
  • I think your biggest problem is getting past your FFT pre-occupation. FFT is central to this, but _just_ FFT won't cut it, as you have found out before coming here. _You can do FIR filtering using sliding windows. Strategies are detailed in the page about brutefir (follow link). FFT does a local translation to the frequency domain. This is all internal to the convolution algorithm, so you don't need to worry about it (except configuring the overlaps, window size, windowing function if you like). I'll also add a link to the freeverb3 tutorial page that might be helpful._ – sehe Apr 30 '11 at 21:08
  • Following your suggestion, I found [DSPUtil](https://github.com/hughpyle/inguz-DSPUtil/wiki/Overview), which has a convolution algorithm implemented. I used it, to convolve the two sounds I have. The result is a sound file or an array of samples (double). Can you please provide practical hints on how I can use this to find the peak you were talking about? – Daniel Hilgarth Apr 30 '11 at 22:08
  • @sehe: I posted a new question, please have a look: http://stackoverflow.com/questions/5847570/use-convolution-to-find-a-reference-audio-sample-in-a-continuous-stream-of-sound – Daniel Hilgarth May 01 '11 at 09:23
1

First I'd smooth it a bit in frequency-direction so that small variations in frequency become less relevant. Then simply take each frequency and subtract the two amplitudes. Square the differences and add them up. Perhaps normalize the signals first so differences in total amplitude don't matter. And then compare the difference to a threshold.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • Thanks for your answer. I don't think that smoothing only the frequencies for each discrete point in time will help a lot... Did you try something like this? Why square the differences and don't just Math.Abs them? – Daniel Hilgarth Apr 30 '11 at 18:34
  • Sound levels are a logarithmic measure. Measuring power in a waveform is the sum of the squared amplitude integrated across the time domain; it's just the physics sounds/waveforms in 2D. You can leave it out, but you'll get biased/skewed comparisons. See again the link to RMS I posted – sehe Apr 30 '11 at 19:17
  • Doing it in the time domain is hard since you don't know at what exact time the start of the sample is. Doing it in frequency space is much more tolerant to small timeshifts. – CodesInChaos Apr 30 '11 at 20:24