2

I am trying to build a graphical audio spectrum analyzer on Linux. I run an FFT function on each buffer of PCM samples/frames fed to the audio hardware so I can see which frequencies are the most prevalent in the audio output. Everything works, except the results from the FFT function only allocate a few array elements (bins) to the lower and mid frequencies. I understand that audio is logarithmic, and the FFT works with linear data. But with so little allocation to low/mid frequencies, I'm not sure how I can separate things cleanly to show the frequency distribution graphically. I have tried with window sizes of 256 up to 1024 bytes, and while the larger windows give more resolution in the low/mid range, it's still not that much. I am also applying a Hann function to each chunk of data to smooth out the window boundaries.

For example, I test using a mono audio file that plays tones at 120, 440, 1000, 5000, 15000 and 20000 Hz. These should be somewhat evenly distributed throughout the spectrum when interpreting them logarithmically. However, since FFTW works linearly, with a 256 element or 1024 element array only about 10% of the return array actually holds values up to about 5 kHz. The remainder of the array from FFTW contains frequencies above 10-15 kHz.

Here's roughly the result I'm after:

enter image description here

But this is what I'm actually getting:

enter image description here

Again, I understand this is probably working as designed, but I still need a way to get more resolution in the bottom and mids so I can separate the frequencies better.

What can I do to make this work?

Synthetix
  • 2,035
  • 3
  • 24
  • 30
  • 1
    why not use supersampled window? so for example your PCM for FFT has 1024 samples you can supersample it into any multiple of 1024 like 65536 ... and do the FFT on it. Either interpolate or just copy the missing values from known neighbors. This way audio latency will not increase , but the resolution of lower frequencies in FFT will ... btw look at this [plotting real time Data on (qwt )Oscillocope](https://stackoverflow.com/a/21658139/2521214) for some inspirations – Spektre Feb 18 '22 at 07:35
  • @Spektre Brilliant idea! That hadn't occurred to me. I'll definitely give that a try. – Synthetix Feb 19 '22 at 07:33

1 Answers1

4

What you are seeing is indeed the expected outcome of an FFT (Fourier Transform). The logarithmic f-axis that you're expecting is achieved by the Constant-Q transform.

Now, the implementation of the Constant-Q transform is non-trivial. The Fourier Transform has become popular precisely because there is a fast implementation (the FFT). In practice, the constant-Q transform is often implemented by using an FFT, and combining multiple high-frequency bins. This discards resolution in the higher bins; it doesn't give you more resolution in the lower bins.

To get more frequency resolution in the lower bins of the FFT, just use a longer window. But if you also want to keep the time resolution, you'll have to use a hop size that's smaller than the window size. In other words, your FFT windows will overlap.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Thanks. Another problem is I need to keep the audio buffers (and therefore the FFT windows) small so there isn't too much latency in the audio output. The software has some interactive elements to it, so I need to keep the buffers in the sub-15ms range (assuming a 48kHz output rate). – Synthetix Feb 17 '22 at 13:18
  • 1
    Well, that puts your FFT sizes at <=15*48; the obvious power of 2 would be 512 samples (about 11 ms). With Nyquist=24 kHz your FFT bins are about 48 Hz. – MSalters Feb 17 '22 at 13:32
  • Then assuming my frequencies are 48 Hz apart, Would it be crazy to simply "cherry pick" the appropriate bins to plot on the analyzer? – Synthetix Feb 17 '22 at 15:24
  • 1
    @Synthetix: It would certainly not be crazy. Is it the best choice? That depends on the intended goal. Another common choice is to sum adjacent bins for higher frequencies. You have 257 bins from 512 samples, with bin 0 (DC) containing inaudible frequencies up 24 Hz. It makes perfect sense to sum the last 128 bins into 8 bands of 16 bins each. This represents the highest octave, 12-24kHz. Each of these 8 bands is thus 1500 Hz. One octave lower (6-12 kHz) you'd sum 64 bins into 8 bands, each of those bands would be 750 Hz. Similarly, you'd create bands of 375, 192, 96 Hz for the next octaves – MSalters Feb 17 '22 at 15:58
  • Thanks so much! This is great info. I'm going to give this a try and report back! – Synthetix Feb 17 '22 at 16:29