Parallelizing Dot Product, more cores makes it slower?

Question

I'm trying to parallelize the dot product operation and I measure the running time of the operation run on various number of cores using OpenMP. I'm getting the result that if N=1e9, then for 1 core the cpu time is 5.6 seconds, for 8 cores 6.0 seconds and for 16 cores 10.8 seconds. Why do the computation time rise when I use more cores?

Here's my code:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <omp.h>

#define DATA_TYPE float

const int N = 1e9;

int main ()
{
  int i, nthreads, tid;
  DATA_TYPE x_par, *y, *z, cput_par;
  clock_t start, end;

  y = (DATA_TYPE*)malloc(sizeof(DATA_TYPE)*N);
  z = (DATA_TYPE*)malloc(sizeof(DATA_TYPE)*N);

  for (i=0; i<N; i++) {
      y[i] = i * 1.0;
      z[i] = i * 2.0;
  }

  x_par = 0;

  //nthreads = omp_get_max_threads();
  nthreads = 1;
  printf("n threads = %d\n", nthreads);
  start=clock();
  omp_set_num_threads(nthreads);
  #pragma omp parallel for reduction(+:x_par)
      for (i=0; i<N; i++)
      {
          x_par += y[i] * z[i];
      }
  end=clock();
  cput_par = ((double)(end-start)/(double)(CLOCKS_PER_SEC));
  printf("Parallel time use: %f\n", cput_par);
  printf("x_par = %f\n", x_par);

  return 0;
}

Do you have 16 physical cores, or are you hyperthreading? [Hyperthreading isn't good for performance](https://www.pugetsystems.com/labs/articles/Hyper-Threading-may-be-Killing-your-Parallel-Performance-578/) — , May 25 '15 at 21:52
omp_get_max_threads() returns 16 on my machine. My hardware is a Intel Xeon E5520 which has 4 cores and 8 threads. The computation time for using 8 threads instead of 1 still goes up a bit though. Why? — neckutrek, May 25 '15 at 22:01
Try changing `nthreads` to 4 and see what happens. If your architecture only supports 8 threads, I don't know why `omp_get_max_threads()` would return 16. As for why it slows down for 8 vs. 1, when you run it on 8 threads, you are hyperthreading. That is, you only have 4 physical cores that have two threads each. Anytime you use both threads on a core for the same process, it is called hyperthreading. — , May 25 '15 at 22:04
Why don't you try using `omp_get_wtime()` instead of `clock()`? What OS did you do these tests on? — Z boson, May 26 '15 at 06:43
@R_Kapp, I forgot to mention that my hardware includes two such processors, yeilding 8 cores in total. @Z boson, Using `omp_get_wtime()` gives me the time "1.07" seconds instead of "6.65" seconds if I use clock() (for N=1e9). Which one of these are right? I'm using ArchLinux. I'm measuring _just outside_ the parallelized part of the program. — neckutrek, May 26 '15 at 09:15
You are also measuring the thread creation time (since you're timing the first parallel region in the code). That is unlikely to represent your real test case. (Aside from the meta-question of "Why aren't you using an optimized math library, and, ideally a higher level BLAS routine?" There are people who spend all their time tuning these functions!) — Jim Cownie, May 26 '15 at 10:59
@JimCownie, how can I exclude the thread creation from my wall-time measurement? I don't want to use a BLAS routine for this as my objective is to measure the speedup relative to a strict sequential case, i.e. the same task only parallelized. — neckutrek, May 26 '15 at 12:15
@MarcusAJohansson, `omp_get_wtime()` returns the wall time. That's what you want. `Clock` does not return the wall time with multiple threads except with the MSFT C libraries on Windows. Is your problem solved now? — Z boson, May 26 '15 at 12:50
@Zboson, I thought it'd be better to use clock() and then divide that by the number of threads used, otherwise other programs that might be run in parallel on the machine can extend the wall time, no? — neckutrek, May 26 '15 at 12:58
I don't know. Why don't you ask a question about that? I always use `omp_get_wtime()`. On a system where other users are likely to be running jobs that may make a difference. According to this [link](https://stackoverflow.com/questions/10673732/openmp-time-and-clock-calculates-two-different-results) "the combined CPU time of all threads as returned by clock() is usually more than the wallclock time measured by omp_get_wtime() except if your application mostly sleeps or waits." That seems to disagree with your result which is why it would be an interesting question. — Z boson, May 26 '15 at 13:08
@MarcusAJohansson "how can I exclude the thread creation from my wall-time measurement?" Execute a single empty parallel region (to force the runtime to create the thread pool) before you start your timer. — Jim Cownie, May 27 '15 at 10:53

score 1 · Answer 1 · answered May 26 '15 at 12:57

The fault was the the total CPU time of all cores/threads used was calculated. To get the average cpu-time given each thread that value needs to be divided by the number of threads. Another way to solve it can be to measure the walltime (i.e. the difference of the actual time of the day before and after the operation). If the walltime is used then the operating system might run another program in between and this is then also included in the walltime. To illustrate this, along with a comparison for a strict sequential case, I post this code:

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h> //gettimeofday()
#include <time.h>
#include <omp.h>

#define DATA_TYPE float

const int N = 1e9;

int main ()
{
    int i, nthreads, tid;
    DATA_TYPE x_seq, x_par, *y, *z;
    struct timeval time;
    double tstart_cpu, tend_cpu, tstart_wall, tend_wall;
    double walltime_seq, walltime_par, cputime_seq, cputime_par;

    nthreads = 8;
    printf("- - -DOT PROCUCT: OPENMP - - -\n");
    printf("Vector size           : %d\n", N);
    printf("Number of threads used: %d\n", nthreads);


    // INITIALIZATION

    y = (DATA_TYPE*)malloc(sizeof(DATA_TYPE)*N);
    z = (DATA_TYPE*)malloc(sizeof(DATA_TYPE)*N);

    for (i=0; i<N; i++) {
        y[i] = i * 1.0;
        z[i] = i * 2.0;
    }

    x_seq = 0;
    x_par = 0;


    // SEQUENTIAL CASE

    gettimeofday(&time, NULL);
    tstart_cpu = (double)clock()/CLOCKS_PER_SEC;
    tstart_wall = (double)time.tv_sec + (double)time.tv_usec * .000001;

    for (i=0; i<N; i++) x_seq += y[i] * z[i];

    tend_cpu = (double)clock()/CLOCKS_PER_SEC;
    gettimeofday(&time, NULL);
    tend_wall = (double)time.tv_sec + (double)time.tv_usec * .000001;

    cputime_seq = tend_cpu-tstart_cpu;
    walltime_seq = tend_wall - tstart_wall;
    printf("Sequential CPU time: %f\n", cputime_seq);
    printf("Sequential Walltime: %f\n", walltime_seq);
    printf("Sequential result  : %f\n", x_seq);


    // PARALLEL CASE

    gettimeofday(&time, NULL);
    tstart_cpu = (double)clock()/CLOCKS_PER_SEC;
    tstart_wall = (double)time.tv_sec + (double)time.tv_usec * .000001;

    omp_set_num_threads(nthreads);
    #pragma omp parallel for reduction(+:x_par)
    for (i=0; i<N; i++)
    {
        x_par += y[i] * z[i];
    }

    tend_cpu = (double)clock()/CLOCKS_PER_SEC;
    gettimeofday(&time, NULL);
    tend_wall = (double)time.tv_sec + (double)time.tv_usec * .000001;

    cputime_par = tend_cpu - tstart_cpu;
    walltime_par = tend_wall - tstart_wall;
    cputime_par /= nthreads; // take the average cpu time per thread
    printf("Parallel CPU time  : %f\n", cputime_par);
    printf("Parallel Walltime  : %f\n", walltime_par);
    printf("Parallel result    : %f\n", x_par);


    // SPEEDUP

    printf("Speedup (cputime)  : %f\n", cputime_seq/cputime_par);
    printf("Speedup (walltime) : %f\n", walltime_seq/walltime_par);

    return 0;
}

And a typical run of it outputs:

- - -DOT PROCUCT: OPENMP - - -
Vector size           : 1000000000
Number of threads used: 8
Sequential CPU time: 4.871956
Sequential Walltime: 4.878946
Sequential result  : 38685626227668133590597632.000000
Parallel CPU time  : 0.751475
Parallel Walltime  : 0.757933
Parallel result    : 133586303067416523805032448.000000
Speedup (cputime)  : 6.483191
Speedup (walltime) : 6.437172

As you can see the resulting dot product is not correct, but this answers the initial question.

Parallelizing Dot Product, more cores makes it slower?

1 Answers1