study of FFT - Why it's not fast?

Question

I am not sure if it's more math or more programming question. If it's math please tell me.

I know there is a lot of ready to use for free FFT projects. But I try to understand FFT method. Just for fun and for studying it. So I made both algorithms - DFT and FFT, to compare them.

But I have problem with my FFT. It seems there is not big difference in efficiency. My FFT is only little bit faster then DFT (in some cases it's two times faster, but it's max acceleration)

In most articles about FFT, there is something about bit reversal. But I don't see the reason to use bit reversing. Probably it's the case. I don't understand it. Please help me. What I do wrong?

This is my code (you can copy it here and see how it works - online compiler):

#include <complex>
#include <iostream>
#include <math.h>
#include <cmath>
#include <vector>
#include <chrono>
#include <ctime>

float _Pi = 3.14159265;
float sampleRate = 44100;
float resolution = 4;
float _SRrange = sampleRate / resolution; // I devide Sample Rate to make the loop smaller,
                                          //just to perform tests faster
float bufferSize = 512;

// Clock class is for measure time to execute whole loop:
class Clock
{
    public:
        Clock() { start = std::chrono::high_resolution_clock::now(); }
        ~Clock() {}

        float secondsElapsed()
        {
            auto stop = std::chrono::high_resolution_clock::now();
            return std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
        }
        void reset() { start = std::chrono::high_resolution_clock::now(); }

    private: 
        std::chrono::time_point<std::chrono::high_resolution_clock> start;
};


// Function to calculate magnitude of complex number:
float _mag_Hf(std::complex<float> sf);

// Function to calculate exp(-j*2*PI*n*k / sampleRate) - where "j" is imaginary number:
std::complex<float> _Wnk_Nc(float n, float k);

// Function to calculate exp(-j*2*PI*k / sampleRate):
std::complex<float> _Wk_Nc(float k);



int main() {
  float scaleFFT = 512; // devide and conquere - if it's "1" then whole algorhitm is just simply DFT
            // I wonder what is the maximum of that value. I alvays thought it should be equal to
            // buffer size (number o samples) but above some value it start to work slower then DFT

  std::vector<float> inputSignal; // array of input signal
  inputSignal.resize(bufferSize); // how many sample we will use to calculate Fourier Transform

  std::vector<std::complex<float>> _Sf; // array to store Fourier Transform value for each measured frequency bin
  _Sf.resize(scaleFFT); // resize it to size which we need.

  std::vector<std::complex<float>> _Hf_Db_vect; //array to store magnitude (in logarythmic dB scale)            
                                                //for each measured frequency bin
  _Hf_Db_vect.resize(_SRrange); //resize it to make it able to store value for each measured freq value

  std::complex<float> _Sf_I_half; // complex to calculate first half of freq range
                                  // from 1 to Nyquist  (sampleRate/2)

  std::complex<float> _Sf_II_half; // complex to calculate second half of freq range
                                   //from Nyquist to sampleRate



        for(int i=0; i<(int)_Sf.size(); i++)
            inputSignal[i]  = cosf((float)i/_Pi); // fill the input signal with some data, no matter


  Clock _time; // Start measure time

for(int freqBinK=0; freqBinK < _SRrange/2; freqBinK++) // start calculate all freq (devide by 2 for two halves)
    {
        for(int i=0; i<(int)_Sf.size(); i++) _Sf[i]  = 0.0f; // clean all values, for next loop we need all values to be zero

        for (int n=0; n<bufferSize/_Sf.size(); ++n) // Here I take all samples in buffer
        {
            std::complex<float> _W = _Wnk_Nc(_Sf.size()*(float)n, freqBinK);

            for(int i=0; i<(int)_Sf.size(); i++) // Finally here is my devide and conquer
                _Sf[i]  += inputSignal[_Sf.size()*n  +i] * _W; // And I see no reason to use any bit reversal, how it shoul be????
        }

        std::complex<float> _Wk = _Wk_Nc(freqBinK);

        _Sf_I_half = 0.0f;
        _Sf_II_half = 0.0f;

        for(int z=0; z<(int)_Sf.size()/2; z++) // here I calculate Fourier transform for each freq
        {
            _Sf_I_half += _Wk_Nc(2.0f * (float)z * freqBinK) * (_Sf[2*z]  + _Wk * _Sf[2*z+1]); // First half - to Nyquist
            _Sf_II_half += _Wk_Nc(2.0f * (float)z *freqBinK) * (_Sf[2*z]  - _Wk * _Sf[2*z+1]); // Second half - to SampleRate
            // also don't see need to use reversal bit, where it shoul be??? :)
        }

        // Calculate magnitude in dB scale
        _Hf_Db_vect[freqBinK] = _mag_Hf(_Sf_I_half); // First half
        _Hf_Db_vect[freqBinK + _SRrange/2] = _mag_Hf(_Sf_II_half); // Second half
    }
  std::cout << _time.secondsElapsed() << std::endl; // time measuer after execution of whole loop
}


float _mag_Hf(std::complex<float> sf)
{
float _Re_2;
float _Im_2;
    _Re_2 = sf.real() * sf.real();
    _Im_2 = sf.imag() * sf.imag();
    return 20*log10(pow(_Re_2 + _Im_2, 0.5f)); //transform magnitude to logarhytmic dB scale
}

std::complex<float> _Wnk_Nc(float n, float k)
{
    std::complex<float> _Wnk_Ncomp;
    _Wnk_Ncomp.real(cosf(-2.0f * _Pi * (float)n * k / sampleRate));
    _Wnk_Ncomp.imag(sinf(-2.0f * _Pi * (float)n * k / sampleRate));
    return _Wnk_Ncomp;
}

std::complex<float> _Wk_Nc(float k)
{
    std::complex<float> _Wk_Ncomp;
    _Wk_Ncomp.real(cosf(-2.0f * _Pi * k / sampleRate));
    _Wk_Ncomp.imag(sinf(-2.0f * _Pi * k / sampleRate));
    return _Wk_Ncomp;
}

I suggest you to learn a bit about how to use a profiler, like Valgrind, and run the application with it. — Jepessen, Mar 12 '18 at 13:52
Make sure you are doing enough work that your measured times will be meaningful and not just noise due to limitations on the accuracy of the clock. Eg measure the time it takes to do 10,000 fft or dft operations in a row, not just one. — Jeremy Friesner, Mar 12 '18 at 13:55
If you need help, you need to make it easy for those that can provide it to read your code. The sloppy and inconsistent formatting accomplishes the opposite. — IInspectable, Mar 12 '18 at 13:56
Profiling will generally tell you what % of time was spent in each function. If you're going to put almost everything in main you're likely going to find out that it spent most of the time in `vector[]` and `main()` which might not help you as much as if you split your program up into functions — UKMonkey, Mar 12 '18 at 13:56
@JeremyFriesner Hey, thanks, yes it’s just example, but normally I use it in Juce, and make FFT and time measure at least every one secont — pajczur, Mar 12 '18 at 14:05
@IInspectable I gave comment for almost each line in the code. I hoped it would help. It looks like I need to learn a lot. I’ve heard one time that there is exist somethink called profiling, but nothing more. I will make that job. I promise. I need more time. Thanks for advice — pajczur, Mar 12 '18 at 14:13
You don't need any tools to properly indent code. Indentation conveys important information. The mere fact, that the code as posted doesn't hurt your eyes, is very alarming. — IInspectable, Mar 12 '18 at 14:16
@UKMonkey I am not sure what you mean? I made it all in main() just for thread, to make it easy to copy fast to any compiler and run. I just want to discus about FFT radix. Normally I have special class for it. :) I am not sure. Do you want to say std::vectors are not good in my case? What would be better. I also though maybe it would be better if i create my own complex, until I use it only to multiplying? — pajczur, Mar 12 '18 at 14:18
If you have your own class - then I'm sure you'll get something of use from it - I was just saying that from your example posted profiling probably wouldn't give you much to work with. — UKMonkey, Mar 12 '18 at 14:25
See What are the rules about using an underscore in a C++ identifier? -- https://stackoverflow.com/questions/228783/what-are-the-rules-about-using-an-underscore-in-a-c-identifier] — Thomas Matthews, Mar 12 '18 at 15:46
Hint: factor common terms or expressions out of `for` loops. Terms that are not dependent on the loop index should be calculated before the loop. — Thomas Matthews, Mar 12 '18 at 15:49
Use temporary variables for common or constant expressions. For some processors, constants are loaded from memory; using temporary variables may tell the compiler to place the values into registers (thus saving time reloading from memory). — Thomas Matthews, Mar 12 '18 at 15:50
Use `std::fill` instead of writing your own loop. The `std::fill` function may call a library function that is optimized for your platform. — Thomas Matthews, Mar 12 '18 at 15:52
Don't place `size()` functions in the limit of your loops. In general, sizes don't change. Calculate once and place into a constant temporary variable. — Thomas Matthews, Mar 12 '18 at 15:53
@ThomasMatthews Did you mean creating more variables in the loops? like in my code `std::complex _W = _Wk_Nc(freqBinK);` ? Did you mean that? — pajczur, Mar 12 '18 at 18:04
You say don't place size() in the limit of loop. But for the matter of testing, it's better solution, because I can change `scaleFFT` in one place and I see how it reacts for the code. Am I right? On the and when I'll be sure my code is OK, than I can change it on constant. — pajczur, Mar 12 '18 at 18:09

Paul R · Answer 1 · 2018-03-12T15:56:52.520

2

One huge mistake you are making is calculating the butterfly weights (which involves sin and cos) on the fly (in _Wnk_Nc()). sin and cos typically cost 10s to 100s of clock cycles, whereas the other butterfly operations are just mul and add, which only take a few cycles, hence the need to factor these out. All fast FFT implementations do this as part of an initialisation step (usually called "plan creation" or similar). See e.g. FFTW and KissFFT.

edited Mar 12 '18 at 15:56

answered Mar 12 '18 at 15:51

Paul R

208,748
37
389
560

Hey great thanks for reply. But as I understand you don't say anything about logic of my code? What you say is very interesting for me also but as I understand is it only about about code optimalisation. But here I am mainly focused on logic of FFT, things like "devide and conquere". Is there everything OK about that? But what about bit reversal? – pajczur Mar 12 '18 at 17:58
And how to make creation part for `_Wnk_Nc` where are to variables `n` and `k`. And both of them could be huge numbers like 44100. So I would need to create array of `complex _Wnk_Nc[pow(44100, 2)]` It would be quite big array. Am I wrong? And on the stage of testing the code it would be a night mare if I want for example change the resolution, buffer size or sample rate. Am I wrong? – pajczur Mar 12 '18 at 18:19
Your question title says: "Why it's not fast?", so I answered that. As for the initialisation, the typical use case for FFTs is for a given N you apply many FFTs of the same size, so you just need to calculate one array of size N weights and re-use it many times. – Paul R Mar 12 '18 at 22:33

score 1 · Answer 2 · answered Mar 15 '18 at 12:32

apart of abovementioned "pre-calculating butterfly weights" optimization, most FFT implementations also use SIMD instructions to vectorize code.

// also don't see need to use reversal bit, where it shoul be?

The very first butterfly loop should be reverse-bit indexed. Those indexes are usually calculated inside recursion, but for loop solution calculating those indexes is also costly, so it's better to pre-calculate them in plan as well.

Combining those optimization approaches result in approximately 100x speedup

hotpaw2 · Answer 3 · 2018-03-15T14:45:59.283

0

Most fast FFT implementations either use a lookup table of precomputed twiddle factors, or a simple recursion to rotate the twiddle factors on the fly, instead of calling trigonometric math library functions inside the FFT inner loop.

For large FFTs, using a trig recursion formula is less likely to thrash the data caches on contemporary processors.

edited Mar 15 '18 at 14:45

answered Mar 15 '18 at 09:34

hotpaw2

70,107
14
90
153

Yes, I’ve heard that. Unfortunately title of my question is wrong. My point is not to optimize code, but to understand FFT calculations at all. That’s why I moved thread to other dsp processing forum: [link]https://dsp.stackexchange.com/questions/47833/fft-second-and-further-divides-and-conquers-need-help[/link] – pajczur Mar 15 '18 at 10:52

study of FFT - Why it's not fast?

3 Answers3