Parallelism isn't for free, and so, however innocently a simple pragma looks like, e.g. #pragma omp task
, it comes at a significant cost, because it hides the entire logic of creating and synchronizing threads, assigning and queueing tasks, etc. Only if you find a balance between the intensity of computations, the expense of multithreading itself, and the size of the problem, (not to mention side-effects of multithreading, like false-sharing etc.), you will observe positive (>1) speed-up.
Also, keep in mind that the number of threads is always limited. Once you already created enough workload for each thread, don't try to boost your code by adding additional work-sharing constructs - a thread cannot magically divide into two separate instruction flows. That is, if you have a top-most loop that is already parallel, and it has enough iterations to keep all available threads busy, you won't gain anything trying to extract nested parallelism.
Having said that, unless you can utilize some other techniques, like memorizing partial results, or removing recursion altogether, then just use a single top-most parallel loop, and a reduction clause to ensure thread-safe access to the shared variable:
#pragma omp parallel for reduction(+:n)
for (int i = 1; i <= r; ++i)
{
n = n + (2 * ncip(dim-1, sqrt(R*R - i*i)));
}
and then a plain sequential function:
int ncip(int dim, double R)
{
int n, r = (int)floor(R);
if (dim == 1)
{
return 1 + 2*r;
}
n = ncip(dim-1, R);
for (int i = 1; i <= r; ++i)
{
n = n + (2 * ncip(dim-1, sqrt(R*R - i*i)));
}
return n;
}
DEMO