Parallelizing C++ code using OpenMP, calculations actually slower in parallel

Question

I have the following code that I want to parallelize:

int ncip( int dim, double R)
{   
    int i;
    int r = (int)floor(R);
    if (dim == 1)
    {   
        return 1 + 2*r; 
    }
    int n = ncip(dim-1, R); // last coord 0

    #pragma omp parallel for
    for(i=1; i<=r; ++i)
    {   
        n += 2*ncip(dim-1, sqrt(R*R - i*i) ); // last coord +- i
    }

    return n;
}

The program execution time when ran without openmp is 6.956s when I try and parallelize the for loop my execution time is greater than 3 minutes (and that's because I ended it myself). What am I doing wrong in regards to parallelizing this code ?

second attempt

    int ncip( int dim, double R)
{   
int i;
int r = (int)floor( R);
if ( dim == 1)
{   return 1 + 2*r; 
}


#pragma omp parallel 
{
int n = ncip( dim-1, R); // last coord 0
#pragma omp for reduction (+:n)
for( i=1; i<=r; ++i)
{   
    n += 2*ncip( dim-1, sqrt( R*R - i*i) ); // last coord +- i
}

}

return n;

}

do you have nested parallelism enabled ? have you tried using `pragma omp task` ? how many threads can you spawn ? what processor do you use ? what arguments do you pass in ? — Piotr Skotnicki, May 24 '16 at 17:39
and you seem to update `n` in each iteration, which is illegal this way, unless you are using `reduction` — Piotr Skotnicki, May 24 '16 at 17:41
use tasks and store intermediate results in an array. in the end, add up all results — Piotr Skotnicki, May 24 '16 at 19:30
Can you please provide a specific reproducible example for your observed runtimes (see [mcve]) — Zulan, May 25 '16 at 08:56

Hélder Gonçalves · Answer 1 · 2016-05-24T22:30:25.007

5

You are doing that wrong!

(1) There are data races in variable n. If you want to parallelize a code that have writes in the same memory zone, you must use the reduction (in the for), atomic or critical to avoid data hazards.

(2) Probably you have the nested parallelism enabled, so the program is creating a new parallel zone every time you call the function ncip. Should be this the main problem. For recursive functions I advise you to create just one parallel zone and then use the pragma omp task.

Do not parallelize with #pragma omp for and try with the #pragma omp task. Look this example:

int ncip(int dim, double R){
    ...
    #pragma omp task
    ncip(XX, XX);

    #pragma omp taskwait
    ...
}

int main(int argc, char *argv[]) {
    #pragma omp parallel
    {
        #pragma omp single 
        ncip(XX, XX);
    } 
    return(0); 
}

UPDATE:

//Detailed version (without omp for and data races)
int ncip(int dim, double R){
    int n, r = (int)floor(R);

    if (dim == 1) return 1 + 2*r; 

    n = ncip(dim-1, R); // last coord 0

    for(int i=1; i<=r; ++i){   
        #pragma omp task
        {
            int aux = 2*ncip(dim-1, sqrt(R*R - i*i) ); // last coord +- i

            #pragma omp atomic
            n += aux;
        }
    }
    #pragma omp taskwait
    return n;
}

PS: You'll not get a speedup from this, because overhead to creat a task is bigger than the work of a single task. The best thing you can do is re-write this algorithm to an iterative version, and then try to parallelize it.

edited May 24 '16 at 22:30

answered May 24 '16 at 18:11

Hélder Gonçalves

731
6
19

2

you probably need also `omp single` after `omp parallel` so that only one thread spawns tasks – Piotr Skotnicki May 24 '16 at 18:18
Yes, you are right, I forgot it to write in this example. – Hélder Gonçalves May 24 '16 at 18:36
@HélderGonçalves The reduce directive goes inside the for loop ? – Mutating Algorithm May 24 '16 at 18:41
Yes. As you can see here: https://software.intel.com/en-us/node/524521 PS: It is **reduction** and not 'reduce'. My bad. – Hélder Gonçalves May 24 '16 at 18:45
@HélderGonçalves I have updated my question with my second attempt, I am still getting errors. I'm very confused. – Mutating Algorithm May 24 '16 at 18:55
You're still spawning a new parallel region every time you call a recursive algorithm, thereby oversubscribing your system and decreasing performance dramatically. And your variable `i` must be private as well. You should be using tasks to perform this like @HélderGonçalves said. – NoseKnowsAll May 24 '16 at 19:02
@NoseKnowsAll a loop iterator is private by default – Piotr Skotnicki May 24 '16 at 19:29
@HélderGonçalves Do I add the reduction right before the line `n += 2 ...` ? – Mutating Algorithm May 24 '16 at 20:33
@HélderGonçalves I tried running your code and I get the answer 0 every single time – Mutating Algorithm May 24 '16 at 20:55
@HélderGonçalves Yes, I tried running your latest version and I always get an answer of 0 – Mutating Algorithm May 24 '16 at 21:02
The task example makes no sense - you spawn a single task just to wait on it. Doing this recursively does not help at all, you will basically have one task per stack level, but all except the innermost are waiting. Just moving the `taskwait` out of the loop is of no help either, as you then again have a race condition on `n`. – Zulan May 24 '16 at 21:17
1

Here are some examples how to uses tasks recursively properly: [wikibooks on OpenMP Tasks](https://en.wikibooks.org/wiki/OpenMP/Tasks) [custom reduction with tasks (and other approaches)](http://stackoverflow.com/a/35805567/620382) [recursive knights tour with tasks, including threshold](https://gist.github.com/jmoy/2151e6d7070a6ce18aa9298fbe050062) ([from this question](http://stackoverflow.com/q/36808397/620382)) – Zulan May 24 '16 at 21:22
1

I corrected the example, to do no wait for each task. It is working. But you'll not get a speedup from this, because overhead to creat a task is bigger than the work of a single task. – Hélder Gonçalves May 24 '16 at 22:22
@HélderGonçalves Your updates answer is still giving me an answer of 0 – Mutating Algorithm May 25 '16 at 11:00

Parallelizing C++ code using OpenMP, calculations actually slower in parallel

1 Answers1

Linked