Efficient implementation of the N fibonacci numbers using complex algorithm

Question

I am looking for an even faster method of calculating the first N fibonacci numbers. I am already aware of the recursive method of solving this problem using memoization as well as the simpler method of iteratively going from 1 - N. However, is there a mathematical way I can solve this problem; or another algorithm that can even make this process slightly shorter?

My current implementation using memoization and recursion is quite fast, but it does not meet the requirements I desire.

#include<bits/stdc++.h>
using namespace std;
 
int fib(int n) {
    int a = 0, b = 1, c, i;
    if( n == 0)
        return a;
    for(i = 2; i <= n; i++)
    {
       c = a + b;
       a = b;
       b = c;
    }
    return b;
}


int main()
{
    int n = 9;
    cout << fib(n);
}

Don't do this: `#include` -- This is a non-standard header, and it brings in an umpteen number of unnecessary header files. All your program needs is `#include ` — PaulMcKenzie, May 26 '23 at 23:39
What are the requirements you desire? Are you aware of the Q-Matrix [method](https://mathworld.wolfram.com/FibonacciQ-Matrix.html)? — Bob__, May 27 '23 at 00:00
There's a matrix formulation to represent the Fibonacci calculation, and Fib(n) is the n^th power of the matrix. This can be calculated with O(log(n)) work using fast exponentiation via squaring the matrix and halving the exponent. However, unless you use big-integer libraries to go for values bigger than Fib(46), this would be totally dominated by table lookup. Also, I don't have a C++ implementation. [Here it is in Ruby](https://stackoverflow.com/a/24439070/2166798) if you're interested, it would be easy to port other than the arbitrary size integer arithmetic that Ruby or Python provide. — pjs, May 27 '23 at 00:07
Ask the question on a math forum. Fibonacci is a special case of **linear homogeneous recursive sequence with constant coefficients.** There are **analytic closed form solutions** for such problems. `A=(1+√5)/2, B=1-A, f[n]*√5=A↑n+B↑n` (where↑ is power). However the round-off error for large `n` should be taken into consideration. — Red.Wave, May 27 '23 at 18:45

selbie · Answer 1 · 2023-05-27T04:31:37.707

Don't deny the power of a table lookup.

Usually absolute math is faster. However, some of the mathematical solutions presented and discussed on this page rely on floating point math and the pow function. Both can be imprecise.

Fib(46) starts to approach the limit for a 32-bit signed int. Fib(92) is the limit for a signed 64-bit integer. Hence, simply hardcoding the answers is a reasonable thing to do.

I realize this isn't the answer you wanted from a computational perspective. But if you really needed to compute fib(n) in a real world setting with reasonable values that don't cause overflow or have floating point deviations on different platforms, it's hard to argue with this.

int fib32(int n) {
    const static int table[] = {
    0,1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765,10946,17711,28657,46368,75025,121393,196418,317811,514229,832040,
    1346269,2178309,3524578,5702887,9227465,14930352,24157817,39088169,63245986,102334155,165580141,267914296,433494437,701408733,1134903170,
    1836311903 };

    if (n < 0 || n >= sizeof(table) / sizeof(table[0])) {
        return -1;
    }

    return table[n];
}

long long fib64(long long n) {
    const static long long table[] = {0LL, 1LL, 1LL, 2LL, 3LL, 5LL, 8LL, 13LL, 21LL, 34LL, 55LL, 89LL, 144LL, 233LL, 377LL, 610LL, 987LL, 1597LL, 2584LL, 4181LL, 6765LL, 10946LL, 17711LL, 28657LL, 46368LL, 75025LL, 121393LL, 196418LL, 317811LL, 514229LL, 832040LL, 1346269LL, 2178309LL,
        3524578LL, 5702887LL, 9227465LL, 14930352LL, 24157817LL, 39088169LL, 63245986LL, 102334155LL, 165580141LL, 267914296LL, 433494437LL, 701408733LL, 1134903170LL, 1836311903LL, 2971215073LL, 4807526976LL, 7778742049LL, 12586269025LL, 20365011074LL, 32951280099LL, 53316291173LL, 86267571272LL, 139583862445LL, 225851433717LL, 365435296162LL, 591286729879LL, 956722026041LL, 1548008755920LL, 2504730781961LL,
        4052739537881LL, 6557470319842LL, 10610209857723LL, 17167680177565LL, 27777890035288LL, 44945570212853LL, 72723460248141LL, 117669030460994LL, 190392490709135LL, 308061521170129LL, 498454011879264LL, 806515533049393LL, 1304969544928657LL, 2111485077978050LL, 3416454622906707LL, 5527939700884757LL, 8944394323791464LL, 14472334024676221LL, 23416728348467685LL, 37889062373143906LL, 61305790721611591LL, 99194853094755497LL, 160500643816367088LL,
        259695496911122585LL, 420196140727489673LL, 679891637638612258LL, 1100087778366101931LL, 1779979416004714189LL, 2880067194370816120LL, 4660046610375530309LL, 7540113804746346429LL};

    if (n < 0 || n >= sizeof(table)/sizeof(table[0])) {
        return -1;
    }

    return table[n];
}

PaulMcKenzie · Answer 2 · 2023-05-27T01:09:17.833

Use Binet's formula.

Here is a rudimentary implementation in C++:

#include <cmath>
#include <iostream>

const double onesq5 = 0.44721359549995793928183473374626; // 1/sqrt(5)
const double phi1 = 1.6180339887498948482045868343656; // (1 + sqrt(5)) / 2
const double phi2 = -0.61803398874989484820458683436564; // (1 - sqrt(5)) / 2

uint64_t fib(int n)
{
    return onesq5 * (pow(phi1, n) - pow(phi2, n));
}

int main()
{
    for (int i = 1; i < 50; ++i)
       std::cout << fib(i) << "\n";
}

Live Example

There is no iteration or recursion to get the nth fibonacci number -- it is computed directly.
Using memoization on Binet's formula would be ok, since you wouldn't be calling pow multiple times with the same value for fib(n).

Jerry Coffin · Answer 3 · 2023-05-27T18:16:11.860

Just for fun, I did a quick speed comparison with the following code:

#include <cmath>
#include <iostream>
#include <chrono>
#include <locale>

unsigned long long fib(unsigned input) {
    unsigned long long old1 = 0;
    unsigned long long old2 = 1;

    for (int i = 1; i < input; i++) {
        unsigned long long sum = old1 + old2;
        old1 = old2;
        old2 = sum;
    }
    return old2;
}

unsigned long long fib64(long long n)
{
    const static unsigned long long table[] = { 0ULL, 1ULL, 1ULL, 2ULL, 3ULL, 5ULL, 8ULL, 13ULL, 21ULL, 34ULL, 55ULL, 89ULL, 144ULL, 233ULL, 377ULL, 610ULL, 987ULL, 1597ULL, 2584ULL, 4181ULL, 6765ULL, 10946ULL, 17711ULL, 28657ULL, 46368ULL, 75025ULL, 121393ULL, 196418ULL, 317811ULL, 514229ULL, 832040ULL, 1346269ULL, 2178309ULL,
        3524578ULL, 5702887ULL, 9227465ULL, 14930352ULL, 24157817ULL, 39088169ULL, 63245986ULL, 102334155ULL, 165580141ULL, 267914296ULL, 433494437ULL, 701408733ULL, 1134903170ULL, 1836311903ULL, 2971215073ULL, 4807526976ULL, 7778742049ULL, 12586269025ULL, 20365011074ULL, 32951280099ULL, 53316291173ULL, 86267571272ULL, 139583862445ULL, 225851433717ULL, 365435296162ULL, 591286729879ULL, 956722026041ULL, 1548008755920ULL, 2504730781961ULL,
        4052739537881ULL, 6557470319842ULL, 10610209857723ULL, 17167680177565ULL, 27777890035288ULL, 44945570212853ULL, 72723460248141ULL, 117669030460994ULL, 190392490709135ULL, 308061521170129ULL, 498454011879264ULL, 806515533049393ULL, 1304969544928657ULL, 2111485077978050ULL, 3416454622906707ULL, 5527939700884757ULL, 8944394323791464ULL, 14472334024676221ULL, 23416728348467685ULL, 37889062373143906ULL, 61305790721611591ULL, 99194853094755497ULL, 160500643816367088ULL,
        259695496911122585ULL, 420196140727489673ULL, 679891637638612258ULL, 1100087778366101931ULL, 1779979416004714189ULL, 2880067194370816120ULL, 4660046610375530309ULL, 7540113804746346429ULL, 12200160415121876738ULL };

    if (n < 0 || n >= sizeof(table) / sizeof(table[0])) {
        return -1;
    }

    return table[n];
}

const double onesq5 = 0.44721359549995793928183473374626; // 1/sqrt(5)
const double phi1 = 1.6180339887498948482045868343656; // (1 + sqrt(5)) / 2
const double phi2 = -0.61803398874989484820458683436564; // (1 - sqrt(5)) / 2

uint64_t fib_binet(int n)
{
    return onesq5 * (pow(phi1, n) - pow(phi2, n));
}

int main() {
    using namespace std::chrono;

    auto start = high_resolution_clock::now();
    auto res = fib(93);
    auto stop = high_resolution_clock::now();

    auto start2 = high_resolution_clock::now();
    auto res2 = fib64(92);
    auto stop2 = high_resolution_clock::now();

    auto start3 = high_resolution_clock::now();
    auto res3 = fib_binet(92);
    auto stop3 = high_resolution_clock::now();

    std::cout.imbue(std::locale(""));
    std::cout << res << "\t"<< res2 << "\t" << res3 << "\n";
    std::cout << "iteration time: " << duration_cast<nanoseconds>(stop - start) << "\n";
    std::cout << "Lookup time: " << duration_cast<nanoseconds>(stop2 - start2) << "\n";
    std::cout << "Binet time: " << duration_cast<nanoseconds>(stop3 - start3) << "\n";
}

The results I got are as follows:

12,200,160,415,121,876,738      12,200,160,415,121,876,738      12,200,160,415,121,913,856
iteration time: 0ns
Lookup time: 0ns
Binet time: 10,691ns

Observations:

although it's theoretically exact, when implemented on a computer using floating point arithmetic, Binet's formula can produce imprecise results.
Although it's clear it'll win when the numbers get large enough, for a result that'll fit in a 64-bit integer, Binet's formula is comparatively slow.
The implementation I'm using seems to have a minimum measurement time of 427 ns, which isn't sufficient to measure the other two meaningfully.

Speculating a bit, choosing between iteration and table lookup may not be trivial. I'd guess a great deal here will depend on call pattern. The table lookup is obviously O(1), but the constants involved potentially vary quite widely. If you're retrieving data from the cache, it'll be quite fast, but if you have to retrieve any of the data from main memory, that'll be considerably slower. If you call it repeatedly in a tight look to get all fib(1) through fib(93), the access pattern will be quite predictable, so on a typical CPU all but the first cache line will be prefetched into the cache, so the total time will be 1 main memory read + 92 cache reads.

A main memory read on a recent CPU takes roughly 42 clock cycles + 51 ns. Subsequent cache reads are likely coming from L1 cache, taking about 4 clocks apiece. So, for this pattern of being called in a tight loop, we get a total of about 92*4 clocks + 51ns. At (say) 4 GHz, that's roughly 92+51ns or 143ns total.

But if we interleave calls to this with a lot of other code, so essentially all of the reads come from main memory instead of cache, then computing all of them ends up as something like 93 * (42 clocks + 51ns). In this case, on our hypothetical 4 GHz processor, it ends up around 5,700 ns.

By contrast, the iterative algorithm is likely to need about one clock per iteration (one add, two moves that are probably handled as register renames in the ROB). To compute fib(1) through fib(93), is an average of 46.5 iterations, for a total of about 46.5 * 93 = 4325 clocks. At 4 GHz, that's about 1,100 ns.

So, to compute them all, the iterative solution is anywhere from about 10 times slower to about 5 times faster.

Side note: depending on the compiler you use, it may or may not be able to directly print out the duration produced for the final time of each algorithm. In that case change: duration_cast<nanoseconds>(stop - start) to something like duration_cast<nanoseconds>(stop - start).count() << "ns".

Reference

Here's a page of results from tests on memory access speed (with the caveat that there's obviously variation depending on both the processor and memory you use).

https://www.7-cpu.com/cpu/Skylake.html

@KellyBundy Well, that's the code from PaulMcKenzie's [answer](https://stackoverflow.com/a/76344727/4944425), they just missed an `#include `. The point that Binet's formula is not exact (from fib(72)) and expansive in the 64bit range still [stands](https://godbolt.org/z/o8e6xPcjT). — Bob__, May 27 '23 at 12:54
@KellyBundy The standard: https://en.cppreference.com/w/cpp/chrono/duration/operator_ltlt . It [needs](https://godbolt.org/z/M1fbhPfa7) C++20. — Bob__, May 27 '23 at 13:29
@Bob__ Thanks. I don't use godbolt, I think it's one of the sites that don't work well on my device. Anyway, isn't 10,691ns suspiciously enormously slow? Even Python only takes [~450 ns](https://ato.pxeger.com/run?1=hZFNTsMwEIXF1qcYiUXs0LTxX2KDepYqEQ7xIo7rOEKcgyWbbuBQnAYnTYVY4d2T_L2Z9-bjy7_FfnSXy-ccu0J93713YRwg2sHYCHbwY4gQjDdNRGgycfZwhCzL0OjMdJZJlHshaka51FJorWWtuWaKKi5qzmtRseoJ7oEepnOIWBLke0sTRvcVVSXnWqlaaKWFEoqVQqoqkbySK4UpPMAGEjgAW2iW6KL8DxcbX_zl0RpvaGJ_C-fHV4SeTQedbU-tdSZiRx4RpBdS3uBgi5oDTn_xsv8OHEnOm2SLJGhpBXVjgBNYB6FxLwbzzckH6yIerMPXLnH2O00zku1g7TYZzUNrwpGWeb6snAM1glxvs53odqof). — Kelly Bundy, May 27 '23 at 13:33
@KellyBundy: Ooops--sorry, when I copied the code, I accidentally missed the first line. I've corrected it. — Jerry Coffin, May 27 '23 at 18:17
Thanks. And I see you added discussion about the other two. What about the Binet one? Why is it so suspiciously enormously slow? — Kelly Bundy, May 27 '23 at 18:23
In @Bob__'s test (the "stands" link), it's only about 10x slower than the other two. — Kelly Bundy, May 27 '23 at 18:28
@KellyBundy If I may, profiling is tricky. See e.g. https://godbolt.org/z/Pfr5Kf57f starting from line 75, g++ calls the clock function interleaved with the actual fibonacci calculations (at least the loop and the call to `fib_net`), while clang (from line 88) just clocks a bunch of `mov`. — Bob__, May 27 '23 at 19:38
I'd have to put some work into analyzing the Binet code to figure out why it's so slow, but it's pretty easy for floating point math to hide a number of things that are more complex than they appear. Even a seemingly-simple conversion from floating point to integer is often quite a bit more expensive than often seems reasonable. Ultimately, I just don't care enough to bother. In a whole lot of cases, either its inaccuracy is going to rule it out completely, or else the speed of iteration for a big number is going to rule it out completely. They rarely compete directly. — Jerry Coffin, May 27 '23 at 20:21

score 1 · Answer 4 · answered May 27 '23 at 07:03

None of the examples yet use the full power of C++ to create the table at compile time like this (https://godbolt.org/z/nMeqj6xvq):

#include <array>
#include <cstdint>
#include <iostream>

//construct the table at compile time

template<std::size_t N>
static constexpr auto create_fibonacci_table()
{
    std::array<std::uint64_t, N> table{ 0,1 };

    for (std::size_t n = 2; n < table.size(); ++n)
    {
        table[n] = table[n - 1ul] + table[n - 2ul];
    }

    return table;
}

int main()
{
    static constexpr auto fibonacci_table = create_fibonacci_table<12>();
    std::cout << fibonacci_table[9ul];
    return 0;
}

Efficient implementation of the N fibonacci numbers using complex algorithm

4 Answers4

Reference