I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).
int fib(int n)
{
if(n == 0 || n == 1)
return n;
if (n < 20) //EDITED CODE TO INCLUDE CUTOFF
return fib(n-1)+fib(n-2);
int res, a, b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
res = a+b;
return res;
}
int main(){
omp_set_nested(1);
omp_set_num_threads(4);
double start_time = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
cout << fib(25) << endl;
}
}
double time = omp_get_wtime() - start_time;
std::cout << "Time(ms): " << time*1000 << std::endl;
return 0;
}