I want to reduce the time to compute 2D FFT in C++ on 100 million complex data

Question

I am trying to compute 2D FFT on 100 million complex data (100000x1000) and it is taking 4.6 seconds approximately, but I want to reduce time. Then I tried to compute it using fftw_thread. But then the computation time has increased (in 2 threads time taken - 8.5 sec an in 4 threads time taken - 16.5 sec). I am using FFTW3 library for C++ and OS - ubuntu 18.04 I am attaching the C++ code below :

#include <iostream>
#include <time.h>
#include <fftw3.h>
using namespace std;
#define ROW 100000
#define COL 1000

int main() {
        fftwf_complex *in = (fftwf_complex *)calloc(ROW*COL,sizeof(fftwf_complex));
        fftwf_complex *out = (fftwf_complex *)calloc(ROW*COL,sizeof(fftwf_complex));

        // generating random data
        for(uint32_t i = 0 ; i < ROW*COL ; i++) {
            in[i][0] = i+1;
            in[i][1] = i+2;
        }
        int thread_number = 2;
        fftwf_plan_with_nthreads(thread_number);
        int h = fftwf_init_threads();
        fftwf_plan p = fftwf_plan_dft_2d(ROW,COL,in,out,FFTW_FORWARD,FFTW_ESTIMATE);
        fftwf_execute(p);
        fftwf_destroy_plan(p);
        fftwf_cleanup_threads();
}

I am getting no error. I want to reduce the execution time. Can anyone please help me in this matter to reduce the time to compute the 2D FFT on 100 million data.

If it takes that long, that's how long it takes. Get a faster computer? Over-clock it more? — tadman, Mar 18 '21 at 06:58
Just to ask the obvious question: You're running a fully optimized build with `-O3` (or equivalent) and not a debug build, correct? — tadman, Mar 18 '21 at 06:58
This is typically the kind of task, where multithreading should be quite efficient. You could provide a [mre], so that we can perform some tests ourself. — Damien, Mar 18 '21 at 07:03
Did you consider padding further on? FFTW is quite sensitive to "odd" sizes. Depends on the context if this approach could be really helpful here. — Secundi, Mar 18 '21 at 08:57
I have attached the entire C++ code. I am not getting any clue why the time has increased if we use multi-threading. Please help me in this matter...Please provide me the code if possible. — Siddhartha Roy, Mar 18 '21 at 10:01
What's the value of `thread_number`? And please answer the optimisation question above. — Richard Critten, Mar 18 '21 at 10:08
If I set thread_number as 2, 4, the time taken by the code is approximately 8.5 seconds and 16.5 seconds respectively. But if run the code using a single thread (thread_number = 1) then the code is taking 4.6 seconds. But I want to achieve it in 1 seconds. — Siddhartha Roy, Mar 18 '21 at 10:11
Just in case you have an NVIDIA GPU you could try [cuFFT](https://docs.nvidia.com/cuda/cufft/index.html). — Claas Bontus, Mar 18 '21 at 13:50

Claas Bontus · Answer 1 · 2021-03-19T13:52:11.780

How did you measure execution time? Note that the actual FFT is done with fftwf_execute. The rest is initialization and cleaning up. See the code below (if you are not on Linux modify time_in_secs to fit to your system). On my computer the code below takes around 10 seconds with one thread, 6 seconds with two threads and around 3.6 seconds with four threads. That's for the FFT part (t3-t2).

#include <iostream>
#include <time.h>
#include <fftw3.h>
#define ROW 100000
#define COL 1000

double
time_in_secs()
{
  struct timespec   t;

  clock_gettime( CLOCK_MONOTONIC /* CLOCK_PROCESS_CPUTIME_ID */, &t );

  return (double)t.tv_sec + 1.0E-09 * (double)t.tv_nsec;
}


int main() {
        fftwf_complex *in = (fftwf_complex *)calloc(ROW*COL,sizeof(fftwf_complex));
        fftwf_complex *out = (fftwf_complex *)calloc(ROW*COL,sizeof(fftwf_complex));

        // generating random data
        for(uint32_t i = 0 ; i < ROW*COL ; i++) {
            in[i][0] = i+1;
            in[i][1] = i+2;
        }
        int thread_number = 6;
        
        double  t1 = time_in_secs();
        
        fftwf_plan_with_nthreads(thread_number);
        int h = fftwf_init_threads();
        fftwf_plan p = fftwf_plan_dft_2d(ROW,COL,in,out,FFTW_FORWARD,FFTW_ESTIMATE);
        
        double  t2 = time_in_secs();
        
        fftwf_execute(p);
        
        double  t3 = time_in_secs();
        
        fftwf_destroy_plan(p);
        fftwf_cleanup_threads();
        
        std::cout << "Time for init: " << t2-t1 << " sec\n";
        std::cout << "Time for FFT:  " << t3-t2 << " sec\n";
        std::cout << "Total time:    " << t3-t1 << " sec\n";
        std::cout << "# threads:     " << thread_number << '\n';
}

Speeding up the initialization process can be done utilizing wisdom as shown below. In the first run of the program the wisdom file will not be found. Computation of the plan takes its time. In successive calls the wisdom will be used for accelerated computation of the plan. Notice that fftwf_init_threads must be called before the wisdom file gets read.

        double  t1 = time_in_secs();
        
        fftwf_plan_with_nthreads(thread_number);
        int h = fftwf_init_threads();
        
        const char * wisdom_file = "fftw_wisdom.dat";
        FILE   *w_file= fopen( wisdom_file, "r" );
        if( w_file )
        {
          int ec = fftwf_import_wisdom_from_file( w_file );
          fclose( w_file );
          std::cout << "Read wisdom file " << ec << '\n';
        }
        else
        {
          std::cout << "No wisdom file found\n";
        }
        
        fftwf_plan p = fftwf_plan_dft_2d(ROW,COL,in,out,FFTW_FORWARD,FFTW_MEASURE);
         
        w_file= fopen( wisdom_file, "w" );
        if( w_file )
        {
          fftwf_export_wisdom_to_file( w_file );
          fclose( w_file );
          std::cout << "Wrote wisdom file\n";
        }
        
        double  t2 = time_in_secs();

Compared to the initial example we have set the planner flag to FFTW_MEASURE. This makes the effect of wisdom storage more pronounced.

I searched on internet and found that execution time can be reduced if we use "wisdom". Is it so ? But I don't understand how to do so and how to use wisdom. — Siddhartha Roy, Mar 18 '21 at 18:20
When creating a plan fftw tries various optimizations. With your choice of `FFTW_ESTIMATE` you selected how complex these trials shall be. See [Planner Flags](http://www.fftw.org/fftw3_doc/Planner-Flags.html). If you choose a larger complexity you wait longer for finishing the plan. Once you have a plan, you can write the parameters to disk. That's wisdom. Next time you can just read the wisdom instead of recomputing the plan. See [the manual](http://www.fftw.org/fftw3_doc/Wisdom.html). Be careful, wisdom strongly depends on the computer. If you change FFT parameters, you need a new plan. — Claas Bontus, Mar 18 '21 at 18:59
Please give me a code in C++ where you use wisdom and then I will understand the wisdom concept more clearly. Because I want to see a practical example of wisdom where parameter is written in disk and wisdom will read it next time. Please give me a code in C++. — Siddhartha Roy, Mar 19 '21 at 05:40
Using wisdom is as easy as writing *hello world!* See above. With this example planning takes about four seconds at my site. Once a wisdom file exists it takes hardly any time. Notice also that you can reuse the plan as long as your program is running. If you have multiple data sets to transform just sequentially call `fftwf_execute` on them. — Claas Bontus, Mar 19 '21 at 13:57

I want to reduce the time to compute 2D FFT in C++ on 100 million complex data

1 Answers1