0

I am trying to measure speedup for different number of threads of the parallel code, where speedup is the ratio of the compute time for the sequential algorithm to the time for the parallel algorithm. I am using OpenMP with FFTW in C++ and using the function omp_get_wtime() to calculate the parallel time and clock() to measure the sequential time. Originally, I was computing the speedup by dividing the parallel time of 1 thread to parallel time of the other different threads since parallel time at 1 thread = sequential time. However, I noticed that the sequential time changes with changing number of threads and now I am not sure how to actually compute my speed up.

Example:

static const int nx = 128; 
static const int ny = 128; 
static const int nz = 128;

double start_time, run_time;
int nThreads = 1; 
fftw_complex *input_array;
input_array = (fftw_complex*) fftw_malloc((nx*ny*nz) * sizeof(fftw_complex));
        
        
memcpy(input_array, Re.data(), (nx*ny*nz) * sizeof(fftw_complex));

fftw_complex *output_array;
output_array = (fftw_complex*) fftw_malloc((nx*ny*nz) * sizeof(fftw_complex));

start_time = omp_get_wtime();
clock_t start_time1 = clock();
            
fftw_init_threads();
fftw_plan_with_nthreads(nThreads); //omp_get_max_threads()
fftw_plan forward = fftw_plan_dft_3d(nx, ny, nz, input_array, output_array, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(forward);
fftw_destroy_plan(forward);
fftw_cleanup();
    
run_time = omp_get_wtime() - start_time;
clock_t end1 = clock();

cout << " Parallel Time in s: " <<  run_time << "s\n";
cout << "Serial Time in s: " <<  (double)(end1-start_time1) / CLOCKS_PER_SEC << "s\n";

        
memcpy(Im.data(),output_array, (nx*ny*nz) * sizeof(fftw_complex));
        
fftw_free(input_array);
fftw_free(output_array);

Results of the above code are the following:

For 1 thread:

Parallel Time in s: 0.0231161s
Serial Time in s: 0.023115s

Gives a speedup = 1 which makes sense

For 2 threads (with ~ 2x speedup):

Parallel Time in s: 0.0132717s
Serial Time in s: 0.025434s

and so on. So, the question is why is the serial time increasing with the number of threads? Or am I supposed to measure the speedup using only omp_get_wtime() with 1 thread treated as my sequential time. I am pretty confused about the speedup/performance of my above code it's either 5/6 times as fast (equal to number of cores on my computer) or only twice as fast depending on how I calculate the sequential time.

Jamie
  • 365
  • 2
  • 5
  • 13
  • 1
    A useful discussion here: https://stackoverflow.com/questions/10673732/openmp-time-and-clock-give-two-different-results. The way to measure serial time (I would interpret that to mean the total time for a single thread to perform all those FFTs, one after the other) would be to write a single threaded program and measure the wall clock time (averaged over several runs). Then measure wall clock time using 2, 3, 4... threads to see how they scale. Something like that, anyway. – Paul Sanders Aug 04 '23 at 01:20
  • Actually, it's interesting what your measurements _do_ show. `clock()` attempts to report how many CPU cycles you burned. This increases somewhat with more threads because there are overheads But, if the code is well designed (and the problem at hand is suitable, which yours is), it's a win overall because total wall clock time is reduced. Just spins up the fan a little bit. – Paul Sanders Aug 04 '23 at 01:49
  • @PaulSanders: Usually, you compile once without the compiler flag to enable openmp, then again with it, and compare wall time for the two. – Jerry Coffin Aug 04 '23 at 01:52
  • @JerryCoffin Sounds reasonable. I think the OP might be limiting the number if threads via OMP_NUM_THREADS or somesuch, I guess that's legit. Probably falls into the 'good enough' category. – Paul Sanders Aug 04 '23 at 02:00
  • 2
    @PaulSanders: He's using: `fftw_plan_with_nthreads(nThreads);` (and setting `nThreads` to `1`, in the code as posted). – Jerry Coffin Aug 04 '23 at 02:02
  • @JerryCoffin Sorry, so he is. I even saw it in the code, must be losing my marbles. – Paul Sanders Aug 04 '23 at 02:06
  • 1
    @PaulSanders: happens to all of us sometimes. – Jerry Coffin Aug 04 '23 at 02:06
  • So, I guess the right way would be to turn off any -fopenmp and -lfftw3_threads flags which means commenting out ``fftw_plan_with_nthreads(nThreads);`` and run the code once using ``clock()`` to find the actual sequential time and use it for speedup measurement – Jamie Aug 04 '23 at 05:03
  • 1
    If you want to compare apples with apples, I would measure wall clock time for both your single threaded and multi-threaded tests, averaged over several runs and run on a lightly loaded machine. There are facilities in `std::chrono` to do that (use `steady_clock`). But rebuilding makes sense. Some overhead will disappear in the single threaded version if you do that (might not be significant though). Jerry obviously knows what he's talking about so it seems sensible to follow his advice. – Paul Sanders Aug 04 '23 at 05:22
  • 1
    You might also find my CPU fun articles on presenting parallel performance useful. https://cpufun.substack.com/p/presenting-parallel-performance-1 – Jim Cownie Aug 04 '23 at 19:26

0 Answers0