0

I am having trouble getting a simple SAXPY program to scale its performance decently using OpenMP.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char** argv){
    int N = atoi(argv[1]), threads = atoi(argv[2]), i;
    omp_set_num_threads(threads);
    double a = 3.141592, *x, *y, t1, t2;
    x = (double*)malloc(sizeof(double)*N);
    y = (double*)malloc(sizeof(double)*N);

    for(i = 0; i < N; ++i){
        x[i] = y[i] = (double)i;
    }

    t1 = omp_get_wtime();
    #pragma omp parallel for default(none) private(i) shared(a, N, x,y)
    for(i = 0; i < N; ++i){
        y[i] = a*x[i] + y[i];
    }
    t2 = omp_get_wtime();

    printf("%f secs\n", t2-t1);
}

I am compiling as:

gcc main.c -lm -O3 -fopenmp -o prog

And the performance I get by for 10M elements is:

threads = 1  0.015097 secs
threads = 2  0.013954 secs

Any idea what is the problem I am having?

  • On which architecture you are running? – simpel01 Nov 25 '15 at 19:23
  • tried on a Intel i7 4700HQ (laptop quadcore) as well as on a 16-core Intel Xeon machine (two sockets of 8-cores). Can you compile, test and tell your speedup? – Cristobal Navarro Nov 25 '15 at 20:22
  • After doing some diagnostic, I have found that If I put another loop inside, repeating 1000 times the instruction, then I get close to linear speedup. Does someone know if the original SAXPY is known to scale bad for multi-core CPUs ? – Cristobal Navarro Nov 25 '15 at 21:15
  • The problem is that the algorithm is memory-bound. There is not enough computation for each fetched from memory, therefore the threads are starving for data. – simpel01 Nov 26 '15 at 07:07
  • 1
    See [this question](http://stackoverflow.com/q/11576670/5239503) to understand why you will or won't see any performance improvement for your parallelisation, depending on the machine you run it on. – Gilles Nov 26 '15 at 08:18
  • Thanks Gilles, it was a very useful read. – Cristobal Navarro Nov 27 '15 at 20:34

1 Answers1

1

You forgot the for in your #pragma omp directive:

#pragma omp parallel for default(none) private(i) shared(a, N, x,y)

Without the for there is no work-sharing, each thread is going to iterate throughout the full range [1, N)

simpel01
  • 1,792
  • 12
  • 13
  • I am cheking this give me a second – Cristobal Navarro Nov 25 '15 at 20:53
  • It was a typo of the question but not of the code, sorry for that. Nevertheless I have fixed the post now. The problem still persists, the performance values are updated because thanks to your post I found a typo on the timer wrongly placed. Still, the problem persists, with very small speedup. – Cristobal Navarro Nov 25 '15 at 20:59
  • 1
    Well, you total overall execution time is too small. Consider that spawning threads is an expensive operation and you need enough computation to compensate it. If I compile the program without `O3` i get linear speedup, I don't think there is anything wrong with the way you parallelized it. – simpel01 Nov 25 '15 at 21:10
  • You are right, I am also getting linear speedup when -O3 is put down. Otherwise, its too little work to note speedup I guess. Many thanks – Cristobal Navarro Nov 25 '15 at 21:24