Why the multithreading FFTW is slower than the single thread FFTW?

Question

I researched the usage of FFTW for a long time, and in my program, the multithreading FFTW with fftw_init_threads() shows slower performance than the single thread version, so, I want to ask for help, please! The following is my program, can you find the problem:

int n1 = 1024, n2 = 1024;
//int nthreads = omp_get_num_procs();

double *x = (double*) fftw_malloc(sizeof(double) * n1 * n2);
fftw_complex *y = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * n1 * (n2 / 2 + 1));
double *z = (double*) fftw_malloc(sizeof(double) * n1 * n2);
int i, j;
clock_t start, end;
double cpu_time_used;

for (i = 0; i < n1; i++) {
    for (j = 0; j < n2; j++) {
        x[i * n2 + j] = rand() / (double) RAND_MAX;
    }
}


fftw_init_threads();
fftw_plan_with_nthreads(1);
//omp_set_num_threads(2);
//omp_init();

r2c_plan = fftw_plan_dft_r2c_2d(n1, n2, x, y, FFTW_ESTIMATE);
c2r_plan = fftw_plan_dft_c2r_2d(n1, n2, y, z, FFTW_ESTIMATE);

start = clock();

//#pragma omp parallel for
for(i = 0; i < 1000; i++){

    fftw_execute(r2c_plan);

    fftw_execute(c2r_plan);
}

end = clock();

cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

printf(cpu_time_used);

fftw_free(x);
fftw_free(y);
fftw_free(z);
fftw_destroy_plan(r2c_plan);
fftw_destroy_plan(c2r_plan);
fftw_cleanup_threads();

return 0;

When changing the variable of the function "_plan_with_nthreads(int nthreads), does the performance of the program improve compared to the single-thread version? Does that mean the velocity is faster?

How big are the FFT's you are doing? The [documentation](https://www.fftw.org/fftw3_doc/How-Many-Threads-to-Use_003f.html) says *"There is a fair amount of overhead involved in synchronizing threads, so the optimal number of threads to use depends upon the size of the transform as well as on the number of processors you have."* and *"Typically, the problem will have to involve at least a few thousand data points before threads become beneficial. If you plan with FFTW_PATIENT, it will automatically disable threads for sizes that don’t benefit from parallelization. "* — Kaz, Jun 25 '23 at 02:28
Thanks for your answer, my matrix is 1024*1024, and i do the forward fft and backward fft as one loop, at the same time, i calculate the time of 1000 loops; by the way, i conduct the same transform with pyfftw, the result shows that multithreading fft is faster than the single one(pyfftw uses the FFTW in its buttom) — lwt, Jun 25 '23 at 02:45
i don't know where the problem is, the phase of installing, or the phase of compiling of c file to exe file? — lwt, Jun 25 '23 at 02:47
Please don’t post images of code, copy-paste the code into your question. https://meta.stackoverflow.com/q/285551/7328782 — Cris Luengo, Jun 25 '23 at 02:51
If you are doing a multi-dimensional FFT, or at least an FFT over many rows of a multi-dimensional array, you want to use a single-threaded FFT, and split your rows across the threads. The higher up in the algorithm the multi-threading happens, the more efficient it will be (ie you always want the outer loop to do the splitting over threads, rather than one of the inner loops). — Cris Luengo, Jun 25 '23 at 02:54
sorry for that, i use the questioning function for the first time, and modify the appearance at once! — lwt, Jun 25 '23 at 02:55
Of course, this will require you to write the loop over the rows, and call FFTW for each 1D row/column. It’s more work, but it will be faster. — Cris Luengo, Jun 25 '23 at 02:57
@Cris Luengo Thanks for your answer, i understand you, and i want to give further exlanation of my program: the loop above is just for calculating the time, because single forward and backward fft time is so short, so i calculate the 1000 loops' time for explicit comparison. — lwt, Jun 25 '23 at 03:02
Your system (you didn't disclose which architecture you are using) is likely very fast at arithmetic compared to the speed at which it can read and write system memory. Multiple threads may be competing to retrieve and store data from system memory. Adding threads doesn't give you more address lines or a wider bus. — Wyck, Jun 25 '23 at 03:16
@Wyck Thanks for your answer, and i wonder why the pyfftw can work faster in the multithreading mode than the single thread, the pyfftw use the fftw at the bottom. Why the c FFTW can't show the same perforamnce? — lwt, Jun 25 '23 at 03:28
Looking at your code, you do a 2D FFT. This means you apply 1D FFT to each row of the image, then again to each column. You do `n1` FFTs of length `n2`, and `n2` FFTs of length `n1`. I don’t know how FFTW applies multithreading in this case, but it likely does each FFT multi-threaded. If so, then writing your own multi-threaded loop over rows and over columns, and calling FFTW for each of the 1D FFTs will be faster. — Cris Luengo, Jun 25 '23 at 04:18
@CrisLuengo It's a fair question though: why doesn't the library itself do that inside `fftw_plan_dft_r2c_2d`? — Kaz, Jun 25 '23 at 04:26
I don't know if it does it or not. I'm thinking it doesn't because you say it's slower when multi-threaded. I don't know. — Cris Luengo, Jun 25 '23 at 04:29
`clock` is a wrong function to measure the speed of a multithreaded program. — n. m. could be an AI, Jun 25 '23 at 05:54
@n.m.will see y'all on Reddit which function should i take to measure the speed of multithreaded program? — lwt, Jun 25 '23 at 06:15
@n.m. will see y'all on Reddit Linux in the VMware, and compiler is gcc — lwt, Jun 25 '23 at 06:30
https://stackoverflow.com/questions/6749621/how-to-create-a-high-resolution-timer-in-linux-to-measure-program-performance (use with `CLOCK_MONOTONIC`, which counts wall time, not CPU time). — Cris Luengo, Jun 25 '23 at 06:52
Related [one](https://stackoverflow.com/questions/11187056/multithreading-taking-equal-time-as-single-thread-quick-sorting/11187702#11187702) [two](https://stackoverflow.com/questions/35592502/clock-function-in-c-with-threads). — n. m. could be an AI, Jun 25 '23 at 07:04
@n.m. will see y'all on Reddit Thank you for the answers, i go to have a look now! — lwt, Jun 25 '23 at 07:18

Why the multithreading FFTW is slower than the single thread FFTW?

0 Answers0