Just for fun, I did a quick speed comparison with the following code:
#include <cmath>
#include <iostream>
#include <chrono>
#include <locale>
unsigned long long fib(unsigned input) {
unsigned long long old1 = 0;
unsigned long long old2 = 1;
for (int i = 1; i < input; i++) {
unsigned long long sum = old1 + old2;
old1 = old2;
old2 = sum;
}
return old2;
}
unsigned long long fib64(long long n)
{
const static unsigned long long table[] = { 0ULL, 1ULL, 1ULL, 2ULL, 3ULL, 5ULL, 8ULL, 13ULL, 21ULL, 34ULL, 55ULL, 89ULL, 144ULL, 233ULL, 377ULL, 610ULL, 987ULL, 1597ULL, 2584ULL, 4181ULL, 6765ULL, 10946ULL, 17711ULL, 28657ULL, 46368ULL, 75025ULL, 121393ULL, 196418ULL, 317811ULL, 514229ULL, 832040ULL, 1346269ULL, 2178309ULL,
3524578ULL, 5702887ULL, 9227465ULL, 14930352ULL, 24157817ULL, 39088169ULL, 63245986ULL, 102334155ULL, 165580141ULL, 267914296ULL, 433494437ULL, 701408733ULL, 1134903170ULL, 1836311903ULL, 2971215073ULL, 4807526976ULL, 7778742049ULL, 12586269025ULL, 20365011074ULL, 32951280099ULL, 53316291173ULL, 86267571272ULL, 139583862445ULL, 225851433717ULL, 365435296162ULL, 591286729879ULL, 956722026041ULL, 1548008755920ULL, 2504730781961ULL,
4052739537881ULL, 6557470319842ULL, 10610209857723ULL, 17167680177565ULL, 27777890035288ULL, 44945570212853ULL, 72723460248141ULL, 117669030460994ULL, 190392490709135ULL, 308061521170129ULL, 498454011879264ULL, 806515533049393ULL, 1304969544928657ULL, 2111485077978050ULL, 3416454622906707ULL, 5527939700884757ULL, 8944394323791464ULL, 14472334024676221ULL, 23416728348467685ULL, 37889062373143906ULL, 61305790721611591ULL, 99194853094755497ULL, 160500643816367088ULL,
259695496911122585ULL, 420196140727489673ULL, 679891637638612258ULL, 1100087778366101931ULL, 1779979416004714189ULL, 2880067194370816120ULL, 4660046610375530309ULL, 7540113804746346429ULL, 12200160415121876738ULL };
if (n < 0 || n >= sizeof(table) / sizeof(table[0])) {
return -1;
}
return table[n];
}
const double onesq5 = 0.44721359549995793928183473374626; // 1/sqrt(5)
const double phi1 = 1.6180339887498948482045868343656; // (1 + sqrt(5)) / 2
const double phi2 = -0.61803398874989484820458683436564; // (1 - sqrt(5)) / 2
uint64_t fib_binet(int n)
{
return onesq5 * (pow(phi1, n) - pow(phi2, n));
}
int main() {
using namespace std::chrono;
auto start = high_resolution_clock::now();
auto res = fib(93);
auto stop = high_resolution_clock::now();
auto start2 = high_resolution_clock::now();
auto res2 = fib64(92);
auto stop2 = high_resolution_clock::now();
auto start3 = high_resolution_clock::now();
auto res3 = fib_binet(92);
auto stop3 = high_resolution_clock::now();
std::cout.imbue(std::locale(""));
std::cout << res << "\t"<< res2 << "\t" << res3 << "\n";
std::cout << "iteration time: " << duration_cast<nanoseconds>(stop - start) << "\n";
std::cout << "Lookup time: " << duration_cast<nanoseconds>(stop2 - start2) << "\n";
std::cout << "Binet time: " << duration_cast<nanoseconds>(stop3 - start3) << "\n";
}
The results I got are as follows:
12,200,160,415,121,876,738 12,200,160,415,121,876,738 12,200,160,415,121,913,856
iteration time: 0ns
Lookup time: 0ns
Binet time: 10,691ns
Observations:
- although it's theoretically exact, when implemented on a computer using floating point arithmetic, Binet's formula can produce imprecise results.
- Although it's clear it'll win when the numbers get large enough, for a result that'll fit in a 64-bit integer, Binet's formula is comparatively slow.
- The implementation I'm using seems to have a minimum measurement time of 427 ns, which isn't sufficient to measure the other two meaningfully.
Speculating a bit, choosing between iteration and table lookup may not be trivial. I'd guess a great deal here will depend on call pattern. The table lookup is obviously O(1), but the constants involved potentially vary quite widely. If you're retrieving data from the cache, it'll be quite fast, but if you have to retrieve any of the data from main memory, that'll be considerably slower. If you call it repeatedly in a tight look to get all fib(1) through fib(93), the access pattern will be quite predictable, so on a typical CPU all but the first cache line will be prefetched into the cache, so the total time will be 1 main memory read + 92 cache reads.
A main memory read on a recent CPU takes roughly 42 clock cycles + 51 ns. Subsequent cache reads are likely coming from L1 cache, taking about 4 clocks apiece. So, for this pattern of being called in a tight loop, we get a total of about 92*4 clocks + 51ns. At (say) 4 GHz, that's roughly 92+51ns or 143ns total.
But if we interleave calls to this with a lot of other code, so essentially all of the reads come from main memory instead of cache, then computing all of them ends up as something like 93 * (42 clocks + 51ns). In this case, on our hypothetical 4 GHz processor, it ends up around 5,700 ns.
By contrast, the iterative algorithm is likely to need about one clock per iteration (one add, two moves that are probably handled as register renames in the ROB). To compute fib(1) through fib(93), is an average of 46.5 iterations, for a total of about 46.5 * 93 = 4325 clocks. At 4 GHz, that's about 1,100 ns.
So, to compute them all, the iterative solution is anywhere from about 10 times slower to about 5 times faster.
Side note: depending on the compiler you use, it may or may not be able to directly print out the duration produced for the final time of each algorithm. In that case change: duration_cast<nanoseconds>(stop - start)
to something like duration_cast<nanoseconds>(stop - start).count() << "ns"
.
Reference
Here's a page of results from tests on memory access speed (with the caveat that there's obviously variation depending on both the processor and memory you use).
https://www.7-cpu.com/cpu/Skylake.html