OpenMP in Ubuntu: parallel program works on double core processor in two times slower than single-threaded. Why?

Question

I get the code from wikipedia:

#include <stdio.h>
#include <omp.h>

#define N 100

int main(int argc, char *argv[])
{
  float a[N], b[N], c[N];
  int i;
  omp_set_dynamic(0);
  omp_set_num_threads(10); 


  for (i = 0; i < N; i++)
  {
      a[i] = i * 1.0;
      b[i] = i * 2.0;
  }

#pragma omp parallel shared(a, b, c) private(i)
  {
#pragma omp for
    for (i = 0; i < N; i++)
      c[i] = a[i] + b[i];
  }
  printf ("%f\n", c[10]);
  return 0;
}

I tryed to compile and run it in my Ubuntu 11.04 with gcc4.5 (my configuration: Intel C2D T7500M 2.2GHz, 2048Mb RAM) and this program worked in two times slower than single-threaded. Why?

`omp_set_num_threads(10)` and "dual core" is not a good combination to begin with (lots of context switches that aren't good for anything). In addition to that, see [this answer to a related question](http://stackoverflow.com/questions/6506987/why-openmp-version-is-slower/6507736#6507736) — Damon, Aug 08 '11 at 11:58
You may want to have a look at the accepted answer to [this question](http://stackoverflow.com/q/16807766/771663). — Massimiliano, Jun 07 '13 at 10:09

score 1 · Answer 1 · answered Aug 08 '11 at 20:16

Very simple answer: Increase N. And set the number of threads equal to the number processors you have.

For your machine, 100 is a very low number. Try some orders of magnitudes higher.

Another question is: How are you measuring the computation time? Usually one takes the program time to get comparable results.

score 0 · Answer 2 · answered Jun 05 '13 at 06:04

0

The issue you are facing is false memory sharing. Each thread should have its own private c[i].

Try this: #pragma omp parallel shared(a, b) private(i, c)

answered Jun 05 '13 at 06:04

Siavash

1

score 0 · Answer 3 · 2013-06-05T09:17:48.527

Run the code below and see the difference.

1.) OpenMP has an overhead so the runtime has to be more than the overhead to see a benefit.

2.) Don't set the number of threads yourself. In general I use the default threads. However, if your processor has hyper-threading you might get a bit better performance by setting the number of threads equal to the number of cores. With hyper threading the default number of threads will be twice the number of cores. For example on my machine I have four cores and the default number of threads is eight. By setting it to four in some situations I get better results and in other cases I get worse results.

3.) There is some false sharing in c but as long as N is large enough (which it needs to be to overcome the overhead) the false sharing will not cause much of a problem. You can play with the chunk size but I don't think it will be helpful.

4.) Cache issues. You have at least four levels of memory (the values are for my system): L1 (32Kb), L2(256Kb), L3(12Mb), and main memory (>>12Mb). The benefits of parallelism are going to diminish as you move into higher level. However, in the example below I set N to 100 million floats which is 400 million bytes or about 381Mb and it is still significantly faster using multiple threads. Try adjusting N and see what happens. For example try setting N to your cache levels/4 (one float is 4 bytes) (arrays a and b also need to be in the cache so you might need to set N to the cache level/12). However, if N is too small you fight with the OpenMP overhead (which is what the code in your question does).

#include <stdio.h>
#include <omp.h>

#define N 100000000

int main(int argc, char *argv[]) {
    float *a = new float[N];
    float *b = new float[N];
    float *c = new float[N];

    int i;
    for (i = 0; i < N; i++) {
        a[i] = i * 1.0;
        b[i] = i * 2.0;
    }
    double dtime;
    dtime = omp_get_wtime();
    for (i = 0; i < N; i++) {
        c[i] = a[i] + b[i];
    }
    dtime = omp_get_wtime() - dtime;
    printf ("time %f, %f\n", dtime, c[10]);

    dtime = omp_get_wtime();
    #pragma omp parallel for private(i)
    for (i = 0; i < N; i++) {
        c[i] = a[i] + b[i];
    }
    dtime = omp_get_wtime() - dtime;
    printf ("time %f, %f\n", dtime, c[10]);

    return 0;
}

score 0 · Answer 4 · answered Aug 08 '11 at 12:34

I suppose the compiler optimized the for loop in the non-smp case (using SSE instructions, e.g.) and it can't in the OMP variant.

Use gcc -S (or objdump -S) to view the assembly for the different variants.

You might want to watch out with the shared variables anyway, because they need to be synchronized, making things very slow. If you can 'smart' chunks (look at the schedule pragma) you might reduce the contention, but again:

verify the emitted code
profile
don't underestimate the efficiency of singlethreaded code (because of cache locality and lack of context switches)
set the number of threads to the number of CPUs (let openMP decide it for you!); unless your thread-team has a master thread with dedicated tasks, in which case there might be value in allocating ONE extra thread

In all the cases where I tried to apply OMP for parallelization, roughly 70% of the cases are slower. The cases where it is a definite speedup is with

coarse-grained parallellism (your sample is on the fine-grained end of the spectrum)
no shared data

OpenMP in Ubuntu: parallel program works on double core processor in two times slower than single-threaded. Why?

4 Answers4