0

I have the following code which I am trying to parallelize using OpenMP.

int ncip(int dim, double R){
int n, r = (int)floor(R);

if (dim == 1) return 1 + 2*r; 

#pragma omp task shared(n, dim)
n = ncip(dim-1, R); // last coord 0

for(int i=1; i<=r; ++i){   
    #pragma omp task shared(n, dim)
    n += 2*ncip(dim-1, sqrt(R*R - i*i) ); // last coord +- i

}
return n;
}

I need to apply task based parallelism because of the recursive calls but i'm not showing any speedup in my computation. What am I doing wrong ? Any suggestions to help speedup this calculation ?

Mutating Algorithm
  • 2,604
  • 2
  • 29
  • 66
  • think about this, you have only 8 threads, and the most computationally intensive part is `sqrt(R*R - i*i)`, but on top of that, you add overhead induced by multithreading itself (creating/syncrhonizing threads). also, putting a variable inside the `shared` clause doesn't automatically make it safe in terms of concurrent access – Piotr Skotnicki May 25 '16 at 16:52
  • @PiotrSkotnicki Can you please show me how to do that ? – Mutating Algorithm May 25 '16 at 16:58
  • @PiotrSkotnicki I have been stuck on this for days – Mutating Algorithm May 25 '16 at 16:58
  • first off, what is `ncip`? is this some kind of a mathematical formula ? – Piotr Skotnicki May 25 '16 at 17:02
  • to add to my previous comment, if you have enough iterations at the top level to divide among the threads you have, and you don't get speed-up, there is little chance you'll achieve better results using nested parallelism. that is, you have only 8 threads, if there's enough work for each already, why would you think you can gain something more using nested parallelism ? – Piotr Skotnicki May 25 '16 at 17:06
  • @PiotrSkotnicki it's implementing a mathematical model to count the number of points inside a sphere of dimension N and radius R. – Mutating Algorithm May 25 '16 at 17:07
  • what values for `dim` and `R` do you use ? it's quite important to know – Piotr Skotnicki May 25 '16 at 17:08
  • did you try already using `parallel for` at the top-most level, so that recursive calls are done within each thread, without spawning more tasks ? – Piotr Skotnicki May 25 '16 at 17:15
  • @PiotrSkotnicki by top most you mean above the first line of code ? – Mutating Algorithm May 25 '16 at 17:18
  • write the entire function body in a parallel for block ? – Mutating Algorithm May 25 '16 at 17:20
  • @PiotrSkotnicki I understand what you are saying but how does that improve performance as opposed to using the task based approach ? – Mutating Algorithm May 25 '16 at 17:22
  • @PiotrSkotnicki your suggestion is indeed faster but why ? What's happening technically ? – Mutating Algorithm May 25 '16 at 17:25
  • 1
    http://coliru.stacked-crooked.com/a/29ae094732e2e56a , does this work ? – Piotr Skotnicki May 25 '16 at 17:26
  • 1
    @PiotrSkotnicki Yes! this works for me, can you explain why this works ? – Mutating Algorithm May 25 '16 at 17:43
  • It is not helpful to post so many questions about the same underlying problem. I counted **13 in the last two days** on your attempt to parallelize this problem. You should focus on the [X-problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Keep minor changes int the approach as discussion within the original question. If you really feel a new question is in order, refer to your original question and briefly differentiate the new question. – Zulan May 25 '16 at 19:14
  • Possible duplicate of [Parallelizing C++ code using OpenMP, calculations actually slower in parallel](http://stackoverflow.com/questions/37420289/parallelizing-c-code-using-openmp-calculations-actually-slower-in-parallel) – Zulan May 25 '16 at 19:14

1 Answers1

0

Parallelism isn't for free, and so, however innocently a simple pragma looks like, e.g. #pragma omp task, it comes at a significant cost, because it hides the entire logic of creating and synchronizing threads, assigning and queueing tasks, etc. Only if you find a balance between the intensity of computations, the expense of multithreading itself, and the size of the problem, (not to mention side-effects of multithreading, like false-sharing etc.), you will observe positive (>1) speed-up.

Also, keep in mind that the number of threads is always limited. Once you already created enough workload for each thread, don't try to boost your code by adding additional work-sharing constructs - a thread cannot magically divide into two separate instruction flows. That is, if you have a top-most loop that is already parallel, and it has enough iterations to keep all available threads busy, you won't gain anything trying to extract nested parallelism.

Having said that, unless you can utilize some other techniques, like memorizing partial results, or removing recursion altogether, then just use a single top-most parallel loop, and a reduction clause to ensure thread-safe access to the shared variable:

#pragma omp parallel for reduction(+:n)
for (int i = 1; i <= r; ++i)
{
    n = n + (2 * ncip(dim-1, sqrt(R*R - i*i)));
}

and then a plain sequential function:

int ncip(int dim, double R)
{
    int n, r = (int)floor(R);

    if (dim == 1)
    {
        return 1 + 2*r; 
    }

    n = ncip(dim-1, R);

    for (int i = 1; i <= r; ++i)
    {   
        n = n + (2 * ncip(dim-1, sqrt(R*R - i*i)));
    }

    return n;
}

DEMO

Piotr Skotnicki
  • 46,953
  • 7
  • 118
  • 160