Why does calculation with OpenMP take 100x more time than with a single thread?

Question

I am trying to test Pi calculation problem with OpenMP. I have this code:

#pragma omp parallel private(i, x, y, myid) shared(n) reduction(+:numIn) num_threads(NUM_THREADS)
{
printf("Thread ID is: %d\n", omp_get_thread_num());
myid = omp_get_thread_num();
printf("Thread myid is: %d\n", myid);

  for(i = myid*(n/NUM_THREADS); i < (myid+1)*(n/NUM_THREADS); i++) {
//for(i = 0; i < n; i++) {

    x = (double)rand()/RAND_MAX;

    y = (double)rand()/RAND_MAX;

    if (x*x + y*y <= 1) numIn++;

  }
printf("Thread ID is: %d\n", omp_get_thread_num());

}

  return 4. * numIn / n;

}

When I compile with gcc -fopenmp pi.c -o hello_pi and run it time ./hello_pi for n = 1000000000 I get

real 8m51.595s

user 4m14.004s

sys 60m59.533s

When I run it on with a single thread I get

real 0m20.943s

user 0m20.881s

sys 0m0.000s

Am I missing something? It should be faster with 8 threads. I have 8-core CPU.

Might the rand() be the bottleneck due to bad multithreading performance? — urzeit, Oct 21 '13 at 08:54
You shouldn't handle the repartition of the work yourself, use `#pragma omp for` instead. — Michael M., Oct 21 '13 at 09:08
Before looking at your code why don't you give the results with optimization on. Use `-O3` or at least `-O2`. — Z boson, Oct 22 '13 at 10:46

score 1 · Answer 1 · answered Oct 21 '13 at 09:14

Please take a look at the http://people.sc.fsu.edu/~jburkardt/c_src/openmp/compute_pi.c This might be a good implementation for pi computing.

It is quite important to know that how your data spread to different threads and how the openmp collect them back. Usually, a bad design (which has data dependencies across threads) running on multiple thread will result in a slower execution than a single thread .

kangshiyin · Answer 2 · 2013-10-21T09:34:40.927

1

rand() in stdlib.h is not thread-safe. Using it in multi-thread environment causes a race condition on its hidden state variables, thus lead to poor performance.

http://man7.org/linux/man-pages/man3/rand.3.html

In fact the following code work well as an OpenMP demo.

$ gc -fopenmp -o pi pi.c -O3; time ./pi
pi: 3.141672

real    0m4.957s
user    0m39.417s
sys 0m0.005s

code:

#include <stdio.h>
#include <omp.h>

int main()
{
    const int n=50000;
    const int NUM_THREADS=8;
    int numIn=0;

    #pragma omp parallel for reduction(+:numIn) num_threads(NUM_THREADS)
    for(int i = 0; i < n; i++) {
        double x = (double)i/n;
        for(int j=0;j<n; j++) {
            double y = (double)j/n;
            if (x*x + y*y <= 1) numIn++;
        }
    }

    printf("pi: %f\n",4.*numIn/n/n);
    return 0;
}

edited Oct 21 '13 at 09:34

answered Oct 21 '13 at 09:14

kangshiyin

9,681
1
17
29

Well, the link you provided states opposite: both `rand()` and `rand_r` are multi-thread safe. I think MT-safe should mean here "does not make a race with corruption" - rand() locks the global variable, thus slower perfomance. `rand_r` is also reentrant that means full re-execution safety - it doesn't use any global counter/storage, so suddenly reentering your function from the beginning you can obtain the same results (not like with rand()). – John_West Aug 18 '15 at 21:21
@John_West the link was provided 2 years before your comment. The content probably had changed already... – kangshiyin Apr 08 '17 at 08:28

score 1 · Answer 3 · edited May 23 '17 at 11:54

In general I would not compare times without optimization on. Compile with something like

gcc -O3 -Wall -pedantic -fopenmp main.c

The rand() function is not thread safe in Linux (but it's fine with MSVC and I guess mingw32 which uses the same C run-time libraries, MSVCRT, as MSVC). You can use rand_r with a different seed for each thread. See openmp-program-is-slower-than-sequential-one.

In general try to avoid defining the chunk sizes when you parallelize a loop. Just use #pragma omp for schedule(shared). You also don't need to specify that the loop variable in a parallelized loop is private (the variable i in your code).

Try the following code

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    int i, numIn, n;
    unsigned int seed;
    double x, y, pi;

    n = 1000000;
    numIn = 0;

    #pragma omp parallel private(seed, x, y) reduction(+:numIn) 
    {
        seed = 25234 + 17 * omp_get_thread_num();
        #pragma omp for
        for (i = 0; i <= n; i++) {
            x = (double)rand_r(&seed) / RAND_MAX;
            y = (double)rand_r(&seed) / RAND_MAX;
            if (x*x + y*y <= 1) numIn++;
        }
    }
    pi = 4.*numIn / n;
    printf("asdf pi %f\n", pi);
    return 0;
}

You can find a working example of this code here http://coliru.stacked-crooked.com/a/9adf1e856fc2b60d

Why does calculation with OpenMP take 100x more time than with a single thread?

3 Answers3

Linked