-1

I wanted to optimize recursion with openMP. So, I started from this question: best-way-to-parallelize-this-recursion-using-openmp

While looking to optimize recursive functions I was first interested in openMP. Starting from the code below that I found there (https://en.cppreference.com/w/cpp/chrono) :

#include <iostream>
#include <chrono>
 
long fibonacci(unsigned n)
{
    if (n < 2) return n;
    return fibonacci(n-1) + fibonacci(n-2);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

Compiled with g++ without any optimization it gives the result:

f(42) = 267914296
elapsed time: 1.88232s

Then I used openMP similarly to the answer upside... but I had to stop the process because it took a very very long time. I do not know a lot about openMP recursion parallelism since I just use it to optimize for loops in my code... Also, I found something in gcc documentation :

__attributes__((const)) 

Then I added it to my openMP "optimized" version but all I obtained is a Core Dump !!!

So I removed my omp pragmas to check that the core dump is due to my code or something else...

Then the code became:

#include <iostream>
#include <chrono>

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
   int long r[2];

   if (n < 2) return n;
   r[0]=fibonacci(n-1);
   r[1]=fibonacci(n-2);
   return (r[0]+r[1]);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

I compiled it with the following command:

g++ fibonacci.cpp -o fibonacci -Wall -O2 -march=native

Now it take a lot less time to perform the fibonacci computation of 42:

f(42) = 267914296
elapsed time: 0.00106504s

in power save mode and in performance mode it gives

f(42) = 267914296
elapsed time: 0.000187806s

Here a lscpu of my computer (if somebody want to compare results):

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            9
    CPU(s) scaling MHz:  95%
    CPU max MHz:         4500,0000
    CPU min MHz:         800,0000
    BogoMIPS:            8403,00

Now, I do not know how much time it could take with openMP because I was not able to obtain a satisfactory result... I hope somebody will find a way to add omp pragmas to this kind of recursive cases.

As required, here is the omp version 1 of this case:

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
    int long a, b;

    if (n < 2) return n;
    #pragma omp parallel
    #pragma omp single nowait
    {         
       #pragma omp task shared(a)
       a=fibonacci(n-1);
       #pragma omp task shared(b)
       b=fibonacci(n-2);
       #pragma omp taskwait
    }
    return (a+b);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

And the version that drive me to a "silly code":

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
   int long r[2];
   int i;
   if (n < 2) return n;
   #pragma omp parallel
   #pragma omp single
   for(i=0;i<2;++i) 
   {
       #pragma omp task shared(r)   
       r[i]=fibonacci(n-i-1);
   }      
    
    return (r[0]+r[1]);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

The build command for these omp versions is :

g++ omp_fibonacci_for.cpp -o omp_fibonacci_for -Wall -fopenmp

Both of them seems to loop "for ever" (fortunately Ctrl-C worked fine). But, while writing the end of this question, the second version with "for", process successfully:

f(42) = 267914296
elapsed time: 480.953s

To be clear I'm not looking for "best perf", my objective is to try to obtain a omp version that process in a reasonable time with the goal of finding the way to use openMP on recursive code. Honestly I was expecting a better execution time than 480s ! I was also surprised by the time it took in the "poor debug" version that took less 0.001s to process...

The thing is now I'd like to know if the way I use openMP is acceptable if I apply this model with "havier" tasks or if I do something wrong.

  • 1
    Perhaps CodeReview would be better since you have a working soluition that you'd like to improve. – Ted Lyngmo May 18 '23 at 00:21
  • 1
    Most of your execution time is spent branching versus data statements. Even though your code can fit inside an instruction cache, the processor still needs to run its branch evaluation. Speeding up your function would require reducing the quantity of branching. – Thomas Matthews May 18 '23 at 00:29
  • 2
    My immediate reaction is that OpenMP isn't likely to help a lot here. OpenMP's primary case is a loop where each iteration of the loop can be computed independently from any other iteration. There's also overhead in distributing work to threads, so you want each iteration to involve enough work to keep that overhead minimal--but here, each iteration is basically a single addition You can certainly use OpenMP other ways, but the further you get from its primary use-case, the less likely it is to do a lot of good. – Jerry Coffin May 18 '23 at 01:59
  • 2
    @JerryCoffin IMO, this is a bit misleading. Yes, 20 years ago OpenMP supported mostly only data (loop) parallelism. But now, OpenMP is much more than just that. Task parallelism was added into OpenMP long time ago and it is suitable for parallelization of recursive algorithms (such as quicksort). Even before tasks, there were sections, and the OpenMP quicksort implementation provided by libstdc++ parallel mode is very efficient (though being quite old). Anyway, I agree that parallelization of Fibonacci with tasks does not make sense, since the amount of work in each task is too low. – Daniel Langr May 18 '23 at 09:50
  • 1
    AFAIK, the code added in the update is incorrect. For task parallelism, you should create only a single parallel section and then create tasks within. Here, you create multiple nested parallel sections recursively. See, e.g., https://acenet-arc.github.io/ACENET_Summer_School_OpenMP/06-fib/. – Daniel Langr May 18 '23 at 13:20
  • 1
    Related question: [OpenMP parallel program of Fibonacci is slower than sequential](https://stackoverflow.com/q/58819829/580083) – Daniel Langr May 18 '23 at 13:20
  • @DanielLangr Many thank to you, the ACENET link you've sent me is exactly what I needed – user10475133 May 18 '23 at 13:35

4 Answers4

2

If you want to compute Fibonacci numbers quickly, you usually want to start by using iteration instead of recursion.

unsigned long long fib(unsigned input) {
    unsigned long long old1 = 0;
    unsigned long long old2 = 1;

    for (int i = 1; i < input; i++) {
        unsigned long long sum = old1 + old2;
        old1 = old2;
        old2 = sum;
    }
    return old2;
}

Note that I've changed this a bit, to use 64-bit unsigned long long's throughout, so we can test with a little larger of numbers. 64-bit numbers are large enough to compute F(93).

Doing a bit of testing, it appears that shortest interval that high_resolution_clock (at least on my system) can measure is around 427 nanoseconds. If I compile this with optimization turned off, and compute F(93), it usually says it ran in 427ns, but sometimes in 855ns. If I turn on optimization so timing means something, it usually says 0ns, but once in a while says 427ns.

Doing a bit of computation, I'd guess it's really more like 30-35ns (roughly 1 clock per iteration, 93 iterations / 2.8 GHz). On your CPU probably closer to 20-25ns.

OpenMP/threading

I see two problems with using threading for this task.

First of all, threading works best when you have things that are independent of each other (e.g., a loop in which each iteration of a loop does the same operation, but each of them on different data). Fibonacci numbers are kind of the opposite of that--each computation depends on the result of the previous computation, so it's difficult to do them in parallel.

Second, using threads adds some overhead to create threads, pass data to them, collect data from them, and so on. For threading to work out, you need enough work to be done that each thread does quite a bit more work than the overhead to get it to do that work. Again, that's pretty nearly the opposite of this case. On a typical system, it takes a lot longer to create even one thread than it takes the code above to compute F(93).

Trying to use threading for this job is like trying to write a word on a piece of paper by several people each write part of a letter, then pass the paper to the next person. One person could write the whole word in less time that it takes to pass the piece of paper around.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Thanks for your reply. The code I used is not adequate to be optimised with openMP. But Your reply learn me a lot of things. – user10475133 May 18 '23 at 10:43
1

You can improve your original result by orders of magnitude by starting with a good recursive algorithm (sans optimizations):

#include <iostream>
#include <chrono>

long fibonacci_recursive(unsigned n, unsigned result, unsigned next)
{
    if (n == 0)
    {
        return result;
    }

    return fibonacci_recursive(n - 1, next, result + next);
}

long fibonacci(unsigned n)
{
    return fibonacci_recursive(n, 0, 1);
}

int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

OUTPUT

% ./a.out
f(42) = 267914296
elapsed time: 4.8508e-05s
%
cdlane
  • 40,441
  • 5
  • 32
  • 81
0

Yours is a silly implementation.

The call

    return fibonacci(n-1) + fibonacci(n-2);

Should be a lookup to an (un)ordered set.

Captain Giraffe
  • 14,407
  • 6
  • 39
  • 67
  • I totally agree with "silly implementation". The thing is that the processing time is 10000 faster in the r[0] + r[1] version... The reason is that I removed omp pragmas... – user10475133 May 18 '23 at 00:57
  • The meaning of my question is rather: is it worth using openMP in this case? Or if you prefer: is there a way to get better results with openMP? (I'm currently thinking about a version using simd with a single openMP thread...). – user10475133 May 18 '23 at 01:03
  • @Majukarma You're not showing how you tried using OpenMP. Given that this is the wrong algorithm, you can probably speed it up by using tasks, but limit your tasks to just the first couple of levels. – Victor Eijkhout May 18 '23 at 09:04
  • Thanks for comments. I’ll put the openMP version in a couple of hours. – user10475133 May 18 '23 at 10:35
  • Why an unoredered set (map)? Wouldn't be an array (vector) better in this case? – Daniel Langr May 18 '23 at 13:17
0

Perfectible solution

So, after taken into consideration all comments and answers, this new code seems to be a better way to apply openMP pragmas on a recursive code:

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
    int long a, b;

    if (n < 2) return n;
    #pragma omp task shared(a)
    a=fibonacci(n-1);
    #pragma omp task shared(b)
    b=fibonacci(n-2);
    #pragma omp taskwait
    
    return (a+b);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    long res;
    #pragma omp parallel
    {
        #pragma omp single
        res = fibonacci(42);
    }
    std::cout << "f(42) = " << res << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

Build command:

g++ omp_fibonacci.cpp -o omp_fibonacci -Wall -fopenmp

Output:

f(42) = 267914296
elapsed time: 255.225s

@DanielLangr comments where particularly useful.

candidate solution

This is the best I obtained with the use of openMP.

And, with @JerryCoffin answer and @ThomasMatthiew comments and the advice of @VictorEijkhout... I though about parallelization differently and it drive me to this code:

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{ 
    int long a, b;
    if (n < 2) return n;
        a=fibonacci(n-2);
        b=fibonacci(n-1);
    return (a+b);
}

long omp_fibonacci(unsigned n)
{
    int long a, b;
    if (n < 2) return n;
    #pragma omp task shared(a)
    a=fibonacci(n-1);
    #pragma omp task shared(b)
    b=fibonacci(n-2);
    #pragma omp taskwait
    
    return (a+b);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    long res;
    #pragma omp parallel 
    {
        #pragma omp single
        res = omp_fibonacci(42);
    }
    std::cout << "f(42) = " << res << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

Outputs:

f(42) = 267914296
elapsed time: 1.00274s