C++ OpenMP Fibonacci: 1 thread performs much faster than 4 threads

Question

I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).

int fib(int n)
  {
    if(n == 0 || n == 1)
        return n;
    if (n < 20) //EDITED CODE TO INCLUDE CUTOFF
        return fib(n-1)+fib(n-2); 
    int res, a, b;
    #pragma omp task shared(a)
    a = fib(n-1);
    #pragma omp task shared(b)
    b = fib(n-2);
    #pragma omp taskwait
    res = a+b;
    return res;
  }

int main(){
  omp_set_nested(1);
  omp_set_num_threads(4);
  double start_time = omp_get_wtime();
  #pragma omp parallel
  {
    #pragma omp single
    {
      cout << fib(25) << endl;
    }
  }
  double time = omp_get_wtime() - start_time;
  std::cout << "Time(ms): " << time*1000 << std::endl;
  return 0;
}

Every number in the fibonacci sequence depends on the previous numbers. Why would this lend itself to multithreading ? You can't expect any performance gain. — Sid S, May 15 '18 at 04:16
One thing you *could* do to speed it up is to cache fibonacci values already calculated. That would give you a significant speedup. — Sid S, May 15 '18 at 04:26
Why `omp_set_nested(1)`? I don't see any nested parallel sections. — Daniel Langr, May 15 '18 at 05:35
Please provide 1) specific performance results 2) information about your system 3) a [mcve] 4) your compiler, version, and command — Zulan, May 15 '18 at 05:36
I did some experiments on 2x12 core Haswell E5 / GCC, and the speedup strongly depended on the cutoff parameter. For cutoff 20, I got speedup around 5, for cutoff 42 around 10 with 24 threads. I calculated `fib(48)`, lower inputs were too quick to get relevant results. — Daniel Langr, May 15 '18 at 05:55
I have just noticed your edit trying a cutoff. You did really well, yet you did not seem to understand its use. Please first use my answer and see how it works (clear to understand) then get back to your edited(!) code and do it again with cutoff. then you may even write your own answer and accept it :)) — Yılmaz Durmaz, May 18 '18 at 02:10

Yılmaz Durmaz · Accepted Answer · 2018-05-17T03:55:53.810

2

Have you tried it with a large number?

In multi-threading, it takes some time to initialize work on CPU cores. For smaller jobs, which is done very fast on a single core, threading slows the job down because of this.

Multi-threading shows increase in speed if the job normally takes time longer than second, not milliseconds.

There is also another bottleneck for threading. If your codes try to create too many threads, mostly by recursive methods, this may cause a delay to all running threads causing a massive set back.

In this OpenMP/Tasks wiki page, it is mentioned and a manual cut off is suggested. There need to be 2 versions of the function and when the thread goes too deep, it continues the recursion with single threading.

EDIT: cutoff variable needs to be increased before entering OMP zone.

the following code is for test purposes for the OP to test

#define CUTOFF 5
int fib_s(int n)
{
    if (n == 0 || n == 1)
        return n;
    int res, a, b;
    a = fib_s(n - 1);
    b = fib_s(n - 2);
    res = a + b;
    return res;
}
int fib_m(int n,int co)
{
    if (co >= CUTOFF) return fib_s(n);
    if (n == 0 || n == 1)
        return n;
    int res, a, b;
    co++;
#pragma omp task shared(a)
    a = fib_m(n - 1,co);
#pragma omp task shared(b)
    b = fib_m(n - 2,co);
#pragma omp taskwait
    res = a + b;
    return res;
}

int main()
{
    omp_set_nested(1);
    omp_set_num_threads(4);
    double start_time = omp_get_wtime();
#pragma omp parallel
    {
#pragma omp single
        {
            cout << fib_m(25,1) << endl;
        }
    }
    double time = omp_get_wtime() - start_time;
    std::cout << "Time(ms): " << time * 1000 << std::endl;
    return 0;
}

RESULT: With CUTOFF value set to 10, it was under 8 seconds to calculate 45th term.

co=1   14.5s
co=2    9.5s
co=3    6.4s
co=10   7.5s
co=15   7.0s 
co=20   8.5s
co=21 >18.0s
co=22 >40.0s

edited May 17 '18 at 03:55

answered May 15 '18 at 01:08

Yılmaz Durmaz

2,374
12
26

Even for large numbers (> 35), the fibonacci is much faster on one thread than on several. – user308485 May 15 '18 at 01:11
@user308485, even 35 is not such a big number as you think. I have tested this nested method and the time seems to be relevant after 40. try it again in this domain. – Yılmaz Durmaz May 15 '18 at 01:25
Still no gain using more cores. Have you tested this code just now? Are you able to get some speedup with more cores at some particular size? – user308485 May 15 '18 at 01:34
look at the `fib()` subroutine. it performs one addition (e.g. compute) and "spawns" two threads (e.g. overhead), so imho, it is no surprise there is no benefit in using `OpenMP` here (a single `for` loop would even be faster than a recursive implementation too) – Gilles Gouaillardet May 15 '18 at 01:43
@user308485, I have an edit to solve your problem, but kept the previous parts for possible future references, you need to add a cut-off to your function. Don't forget, you need two version of your function. Oh, by the way, your code does not do parallel looping but makes a tree of threads, that is why it takes too much time. – Yılmaz Durmaz May 15 '18 at 01:53
@GillesGouaillardet, I saw the problem and corrected my answer. by the way, "for loop" will not benefit here because of the recursion. – Yılmaz Durmaz May 15 '18 at 01:54
@YılmazDurmaz By cutoff, do you mean if n is too small, then just run a serial algorithm? I have tested that as well, using cutoffs like n = 20. The serial version is still faster (although both are faster overall). Also, I'm quite certain that this code doesn't create a tree of threads, since only 4 threads are present at most (the threads are created in the pragma omp parallel) . It does create a tree of tasks (which is the purpose of recursion), but the tasks are fairly large when the cutoff is added so the overhead shouldn't be a concern then. I changed my code to add the cutoff. – user308485 May 15 '18 at 01:56
@user308485, I have just added a code to the answer. I don't have a compiler to test. It might include a few typos, just correct them if any. try different CUTOFF values and write down results to compare. CUTOFF=1 should be the same as a single core run. If it works as you expect, please select it the answer to your question. If not, I cannot help anymore as I have other things to do. AND every recursion is a tree, so every call of fib creates a branch of the tree. – Yılmaz Durmaz May 15 '18 at 02:08
@YılmazDurmaz my point is a recursive implementation (regardless how many threads are used) is highly inefficient and cannot beat a straightforward `for` loop. `int fib(int n) { int i; if (n < 2) return n; int a[2] = {0, 1}; for (i=0; i – Gilles Gouaillardet May 15 '18 at 02:26
@user308485, I just saw a missing part in the code :)) main method should call `fib_m(25,1)`. CUTOFF=1 should be same as single core, 2 as fast twice and 3 as fast three times, but at some point the time should increase because of excessive number of threads. – Yılmaz Durmaz May 15 '18 at 02:57
@user308485, today I installed CYGWIN and gcc compiler just for you. I have found out that I placed CUTOFF in wrong place. it should be increased before OMP region. the code now calculates fib-45 under 8 seconds with cutoff value of 10. you can print out timings for different values only to find this is the only way to increase speed in nested methods. – Yılmaz Durmaz May 17 '18 at 03:15
@user308485, thanks for accepting this answer. as the last word, do you remember I asked you to find the difference between yours and mine? your edited code dives into threads only until the LAST 20 levels, mine dives into the FIRST 20 levels, if set for 20. in your code you need to know how deep it might go, in mine you just set how deep it should go. that's it. farewell and happy coding. – Yılmaz Durmaz May 20 '18 at 07:58

score 0 · Answer 2 · answered Nov 04 '19 at 01:25

I believe I do not know how to tell the compiler not to create parallel task after a certain depth as: omp_set_max_active_levels seems to have no effect and omp_set_nested is deprecated (though it also has no effect).

So I have to manually specify after which level not to create more tasks. Which IMHO is sad. I still believe there should be a way to do this (if somebody know, kindly let me know). Here is how I attempted it, and after input size of 20 parallel version runs a bit faster than serial (like in 70-80% time). Ref: Code taken from an assignment from course (solution was not provided, so I don't know how to do it efficiently): https://www.cs.iastate.edu/courses/2018/fall/com-s-527x

#include <stdio.h>
#include <omp.h>
#include <math.h>

int fib(int n, int rec_height)
{
  int x = 1, y = 1;
  if (n < 2) 
      return n;
  int tCount = 0;

  if (rec_height > 0)   //Surprisingly without this check parallel code is slower than serial one (I believe it is not needed, I just don't know how to use OpneMP)
  {
   rec_height -= 1;
  #pragma omp task shared(x)
  x = fib(n - 1, rec_height);
  #pragma omp task shared(y)
  y = fib(n - 2, rec_height);
  #pragma omp taskwait
  }
  else{
    x = fib(n - 1, rec_height);
    y = fib(n - 2, rec_height);
  }
  return x+y;

}


int main()
{
  int tot_thread = 16;
  int recDepth = (int)log2f(tot_thread);
  if( ((int)pow(2, recDepth)) < tot_thread) recDepth += 1;
  printf("\nrecDepth: %d\n",recDepth);
  omp_set_max_active_levels(recDepth);
  omp_set_nested(recDepth-1);

  int n,fibonacci;
  double starttime;
  printf("\nPlease insert n, to calculate fib(n): %d\n",n);
  scanf("%d",&n);
  omp_set_num_threads(tot_thread);
  starttime=omp_get_wtime();
  #pragma omp parallel
  {
   #pragma omp single
   {
    fibonacci=fib(n, recDepth);
   }
  }
  printf("\n\nfib(%d)=%d \n",n,fibonacci);
  printf("calculation took %lf sec\n",omp_get_wtime()-starttime);
  return 0;
}

C++ OpenMP Fibonacci: 1 thread performs much faster than 4 threads

2 Answers2

Linked