6

I have to add two vectors and compare serial performance against parallel performance. However, my parallel code seems to take longer to execute than the serial code.

Could you please suggest changes to make the parallel code faster?

#include <iostream>
#include <time.h>
#include "omp.h"
#define ull unsigned long long

using namespace std;

void parallelAddition (ull N, const double *A, const double *B, double *C)
{
    ull i;

    #pragma omp parallel for shared (A,B,C,N) private(i) schedule(static)
    for (i = 0; i < N; ++i)
    {
        C[i] = A[i] + B[i];
    }
}

int main(){

ull n = 100000000;
double* A = new double[n];
double* B = new double[n];
double* C = new double[n];

double time_spent = 0.0;


for(ull i = 0; i<n; i++)
{
    A[i] = 1;
    B[i] = 1;
}

//PARALLEL
clock_t begin = clock();
parallelAddition(n, &A[0], &B[0], &C[0]);
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;

cout<<"time elapsed in parallel : "<<time_spent<<endl;


//SERIAL 
time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
    A[i] = 1;
    B[i] = 1;
}
begin = clock();
for (ull i = 0; i < n; ++i)
{
    C[i] = A[i] + B[i];
}
end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}

These are results:

time elapsed in parallel : 0.824808

time elapsed in serial : 0.351246

I've read on another thread that there are factors like spawning of threads, allocation of resources. But I don't know what to do to get the expected result.


EDIT:

Thanks! @zulan and @Daniel Langr 's answers actually helped!

I used omp_get_wtime() instead of clock(). It happens to be that clock() measures cumulative time of all threads as against omp_get_wtime() which can be used to measure the time elasped from an arbitrary point to some other arbitrary point

This answer too answers this query pretty well: https://stackoverflow.com/a/10874371/4305675

Here's the fixed code:

void parallelAddition (ull N, const double *A, const double *B, double *C)
{
    ....
}

int main(){

    ....


    //PARALLEL
    double begin = omp_get_wtime();
    parallelAddition(n, &A[0], &B[0], &C[0]);
    double end = omp_get_wtime();
    time_spent += (double)(end - begin);

    cout<<"time elapsed in parallel : "<<time_spent<<endl;



    ....

    //SERIAL
    begin = omp_get_wtime();
    for (ull i = 0; i < n; ++i)
    {
        C[i] = A[i] + B[i];
    }
    end = omp_get_wtime();
    time_spent += (double)(end - begin);
    cout<<"time elapsed in serial : "<<time_spent;
return 0;
}

RESULT AFTER CHANGES:

time elapsed in parallel : 0.204763

time elapsed in serial : 0.351711

Nitish Prajapati
  • 139
  • 1
  • 11
  • 3
    What compilation flags did you use? – NathanOliver Oct 29 '19 at 14:39
  • 1
    I am not sure, how much optimzation the compiler can do in this case, but did you consider using random values instead of the number `1`? – RoQuOTriX Oct 29 '19 at 14:40
  • 1
    What is the value of `OMP_NUM_THREADS`? Can you please check the return value of `omp_get_num_threads()`? – Davide Spataro Oct 29 '19 at 14:41
  • 1
    I think the compiler will just remove the loops, since you never use `C[i]`. See also [on godbolt](https://godbolt.org/z/BkP6wV). You should try and print all values of `C` at the end. – Lukas-T Oct 29 '19 at 14:54
  • @NathanOliver, while compiling i used -fopenmp. this is how i compiled: g++ -fopenmp 1.cpp -o 1 – Nitish Prajapati Oct 29 '19 at 16:38
  • @DavideSpataro, which part of code should i check of `OMP_NUM_THREADS` ?. I checked at end of program and it is outputting `1`. – Nitish Prajapati Oct 29 '19 at 16:40
  • 1
    @NitishPrajapati Okay. Do note that when benchmarking, you should turn on optimizations (`-O2` or `-O3`). The optimizer can do a lot that could make your sequential loop run a lot faster like using SIMD – NathanOliver Oct 29 '19 at 16:41

1 Answers1

5

There are multiple factors that influence your measurements:

  1. Use omp_get_wtime() as @zulan suggested, otherwise, you may actually calculate combined CPU time, instead of wall time.

  2. Threading has some overhead and typically does not pay off for short calculations. You may want to use higher n.

  3. "Touch" data in C array before running parallelAddition. Otherwise, the memory pages are actually allocated from OS inside parallelAddition. Easy fix since C++11: double* C = new double[n]{};.

I tried your program for n being 1G and the last change reduced runtime of parallelAddition from 1.54 to 0.94 [s] for 2 threads. Serial version took 1.83 [s], therefore, the speedup with 2 threads was 1.95, which was pretty close to ideal.

Other considerations:

  • Generally, if you profile something, make sure that the program has some observable effect. Otherwise, a compiler may optimize a lot of code away. Your array addition has no observable effect.

  • Add some form of restrict keyword to the C parameter. Without it, a compiler might not be able to apply vectorization.

  • If you are on a multi-socket system, take care about affinity of threads and NUMA effects. On my dual-socket system, runtime of a parallel version for 2 threads took 0.94 [s] (as mentioned above) when restricting threads to a single NUMA node (numactl -N 0 -m 0). Without numactl, it took 1.35 [s], thus 1.44 times more.

Daniel Langr
  • 22,196
  • 3
  • 50
  • 93