One approach I've found to be helpful is to generate two reference waves 90 degrees apart (I call them "sine" and "cosine") and take the dot product of the input waveform with those reference waves over some fairly short (say 1/60 second) stretches of the input. That will give you a somewhat noisy indicator of how much of the input frequency you have that's in phase or out of phase with regard to your reference waves (the square root of the sum of the squares of the values generated using the two reference waves will be the amplitude). With a small window size, you'll notice that the output is rather noisy, but if you filter the output with something like a simple FIR or IIR filter you should probably get something pretty reasonable.
One nice trick is to generate two amplitude numbers: for the first one, run the sine and cosine amplitudes through two rounds of filtering, then compute the sum of the squares. For the second, run the amplitudes through one round of filtering, then compute the sum of the squares, and then run that through another round of filtering.
Both amplitude measurements will experience the same delay, but the first one will be much more selective than the second; you can thus tell very clearly whether a frequency is 'right on' or is a bit off. Using this approach, it's possible to detect DTMF tones quickly while rejecting tones that are even a few Hz off (off-pitch tones will pick up much more strongly on the 'loose' detector than the tight one).
Sample code:
double sine_phase,sine_freq;
void process_some_waves(double *input, int16 len,
double *sine_phase, double sine_freq,
double *sine_result, double *cosine_result)
{
int i;
double phase, sin_tot,cos_tot;
phase = *sine_phase;
sin_tot = cos_tot = 0;
for (i=0; len > i; i++)
{
sin_tot += input[i] * sin(phase);
cos_tot += input[i] * cos(phase);
phase += sine_freq;
}
*sine_result = sin_tot;
*cosine_result = cos_tot;
*sine_phase = phase;
}
/* Takes first element in buffer and 'smears' it through buffer with simple Gaussian resp. */
void simple_fir_filter(double *buff, int buffsize)
{
int i;
for (i=buffsize-1; i>=2; i--)
buff[i] = (buff[i-1] + buff[i-2])/2;
}
#define FILTER_SIZE1 10
#define FILTER_SIZE2 8
#define SECTION_LENGTH 128
#define FREQ whatever
double sine_buff1[FILTER_SIZE1], sine_buff2[FILTER_SIZE2];
double cos_buff1[FILTER_SIZE1], cos_buff2[FILTER_SIZE2];
double combined_buff[FILTER_SIZE2];
double tight_amplitude, loose_amplitude;
double ref_phase;
void handle_some_data(double *input)
{
/* Put results in first element of filter buffers */
process_some_waves(input, SECTION_LENGTH, &ref_phase, FREQ, sine_buff1, cos_buff1);
/* Run first stage of filtering */
simple_fir_filter(sine_buff1, FILTER_SIZE1);
simple_fir_filter(cosine_buff1, FILTER_SIZE1);
/* Last element of each array will hold results of filtering. */
/* Now do second stage */
sine_buff2[0] = sine_buff1[FILTER_SIZE1-1];
cosine_buff2[0] = cosine_buff1[FILTER_SIZE1-1];
combined_buff[0] = sine_buff2[0]*sine_buff2[0] + cosine_buff2[0]*cosine_buff2[0];
simple_fir_filter(sine_buff2, FILTER_SIZE2);
simple_fir_filter(cosine_buff2, FILTER_SIZE2);
simple_fir_filter(combined_buff, FILTER_SIZE2);
tight_amplitude = sine_buff2[FILTER_SIZE2-1]*sine_buff2[FILTER_SIZE2-1] +
cosine_buff2[FILTER_SIZE2-1]*cosine_buff2[FILTER_SIZE2-1];
loose_amplitude = combined_buff2[FILTER_SIZE2-1];
}
The code here uses 'double' for all math other than array subscripting. In practice, it would almost certainly be faster to replace some of the math with integer maths. On machines with floating point, I would expect the best approach would be to keep the phase as a 32-bit integer and use a table of ~4096 'single' sine values (the smaller the table size in RAM, the better the cache coherency performance). I used code very much like the above on a fixed-point (integer) DSP with great success; the sine and cosine computations in process_some_waves were done in separate "loops", with each "loop" being realized as a single instruction with a "repeat" prefix.