FFT Algorithm: What goes IN/OUT? (re: real-time pitch detection)

Question

I am attempting to extract pitch data from an audio stream. From what I can see, it looks as though FFT is the best algorithm to use.

Rather than digging straight into the math, could someone help me understand what this FFT algorithm does?

Please don't say something obvious like 'FFT extracts frequency data from a raw signal.' I need the next level of detail.

What do I pass in, and what do I get out?

Once I understand the interface clearly, this will help me to understand the implementation.

I take it I need to pass in an audio buffer, I need to tell it how many bytes to use for each computation (say the most recent 1024 bytes from this buffer). and maybe I need to specify the range of pitches I want it to detect. Now it is going to pass back what? An array of frequency bins? What are these?

(Edit:) I have found a C++ algorithm to use (if I can only understand it)

Performous extracts pitch from the microphone. Also the code is open source. Here is a description of what the algorithm does, from the guy that coded it.

PCM input (with buffering)
FFT (1024 samples at a time, remove 200 samples from front of the buffer afterwards)
Reassignment method (against the previous FFT that was 200 samples earlier)
Filtering of peaks (this part could be done much better or even left out)
Combining peaks into sets of harmonics (we call the combination a tone)
Temporal filtering of tones (update the set of tones detected earlier instead of simply using the newly detected ones)
Pick the best vocal tone (frequency limits, weighting, could use the harmonic array also but I don't think we do)

But could someone help me understand how this works? What is it that is getting sent from the FFT to the Reassignment method?

Numerous duplicates, e.g. [Real-time pitch detection using FFT](http://stackoverflow.com/questions/1466968/real-time-pitch-detection-using-fft) — Paul R, Nov 01 '10 at 10:23
Note also that `pitch` is not the same thing as `frequency` - make sure you understand the difference as your solution will be very different if you want to measure one or the other. — Paul R, Nov 01 '10 at 10:29

score 3 · Answer 1 · edited May 23 '17 at 12:23

The FFT is just one building block in the process, and it may not be the best approach for pitch detection. Read up on pitch detection and decide which algo you want to use first (this will depend on what exactly you are trying to measure the pitch of - speech, single musical instrument, other types of sound, etc. Get this right before getting into low level details such as the FFT (some, but not all pitch detection algorithms use the FFT internally).

There are numerous similar questions on SO already, e.g. Real-time pitch detection using FFT and Pitch detection using FFT for trumpet, and there is good overview material on Wikipedia etc - read these and then decide whether you still want to roll your own FFT-based solution or perhaps use an existing library which is suitable for your particular application.

score 2 · Accepted Answer · answered Nov 01 '10 at 10:26

There is an element of choice here. The most straightforward to implement is to do (2^n samples in) complex numbers in, and 2^n complex numbers out, so maybe you should start with that.

In the special case of a DCT(discrete cosine transform), typically what goes in is 2^n samples (often floats), and out go 2^n values, often floats too. DCT is an FFT but that takes only the real values, and analyses the function in terms of cosines.

It is smart (but commonly skipped) to define a struct to handle the complex values. Traditionally FFT's are done in-place, but it works fine if you don't.

It can be useful to instantiate a class that contains a work buffer for the FFT (if you don't want to do the FFT in-place), and reuse that for several FFTs.

but what are these values that come OUT from the DFT/FFT? – P i Nov 01 '10 at 10:52 — P i, Nov 01 '10 at 10:52

score 1 · Answer 3 · answered Dec 16 '10 at 00:47

In goes N samples of PCM (purely real complex numbers). Out comes N bins of frequency domain (each bin corresponding to a 1/N slice of sample rate). Each bin is a complex number. Rather than real and imaginary parts, these values should generally be handled in polar format (absolute value and argument). The absolute value tells the amount of sound near the bin center frequency while the argument tells the phase (at which position the sine wave is travelling).

Most often coders only use the magnitude (absolute value) and throw away the phase angle (argument).

FFT Algorithm: What goes IN/OUT? (re: real-time pitch detection)

3 Answers3