The fault was the the total CPU time of all cores/threads used was calculated. To get the average cpu-time given each thread that value needs to be divided by the number of threads. Another way to solve it can be to measure the walltime (i.e. the difference of the actual time of the day before and after the operation). If the walltime is used then the operating system might run another program in between and this is then also included in the walltime. To illustrate this, along with a comparison for a strict sequential case, I post this code:
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h> //gettimeofday()
#include <time.h>
#include <omp.h>
#define DATA_TYPE float
const int N = 1e9;
int main ()
{
int i, nthreads, tid;
DATA_TYPE x_seq, x_par, *y, *z;
struct timeval time;
double tstart_cpu, tend_cpu, tstart_wall, tend_wall;
double walltime_seq, walltime_par, cputime_seq, cputime_par;
nthreads = 8;
printf("- - -DOT PROCUCT: OPENMP - - -\n");
printf("Vector size : %d\n", N);
printf("Number of threads used: %d\n", nthreads);
// INITIALIZATION
y = (DATA_TYPE*)malloc(sizeof(DATA_TYPE)*N);
z = (DATA_TYPE*)malloc(sizeof(DATA_TYPE)*N);
for (i=0; i<N; i++) {
y[i] = i * 1.0;
z[i] = i * 2.0;
}
x_seq = 0;
x_par = 0;
// SEQUENTIAL CASE
gettimeofday(&time, NULL);
tstart_cpu = (double)clock()/CLOCKS_PER_SEC;
tstart_wall = (double)time.tv_sec + (double)time.tv_usec * .000001;
for (i=0; i<N; i++) x_seq += y[i] * z[i];
tend_cpu = (double)clock()/CLOCKS_PER_SEC;
gettimeofday(&time, NULL);
tend_wall = (double)time.tv_sec + (double)time.tv_usec * .000001;
cputime_seq = tend_cpu-tstart_cpu;
walltime_seq = tend_wall - tstart_wall;
printf("Sequential CPU time: %f\n", cputime_seq);
printf("Sequential Walltime: %f\n", walltime_seq);
printf("Sequential result : %f\n", x_seq);
// PARALLEL CASE
gettimeofday(&time, NULL);
tstart_cpu = (double)clock()/CLOCKS_PER_SEC;
tstart_wall = (double)time.tv_sec + (double)time.tv_usec * .000001;
omp_set_num_threads(nthreads);
#pragma omp parallel for reduction(+:x_par)
for (i=0; i<N; i++)
{
x_par += y[i] * z[i];
}
tend_cpu = (double)clock()/CLOCKS_PER_SEC;
gettimeofday(&time, NULL);
tend_wall = (double)time.tv_sec + (double)time.tv_usec * .000001;
cputime_par = tend_cpu - tstart_cpu;
walltime_par = tend_wall - tstart_wall;
cputime_par /= nthreads; // take the average cpu time per thread
printf("Parallel CPU time : %f\n", cputime_par);
printf("Parallel Walltime : %f\n", walltime_par);
printf("Parallel result : %f\n", x_par);
// SPEEDUP
printf("Speedup (cputime) : %f\n", cputime_seq/cputime_par);
printf("Speedup (walltime) : %f\n", walltime_seq/walltime_par);
return 0;
}
And a typical run of it outputs:
- - -DOT PROCUCT: OPENMP - - -
Vector size : 1000000000
Number of threads used: 8
Sequential CPU time: 4.871956
Sequential Walltime: 4.878946
Sequential result : 38685626227668133590597632.000000
Parallel CPU time : 0.751475
Parallel Walltime : 0.757933
Parallel result : 133586303067416523805032448.000000
Speedup (cputime) : 6.483191
Speedup (walltime) : 6.437172
As you can see the resulting dot product is not correct, but this answers the initial question.