Fast Fibonacci computation

Question

I saw a comment on Google+ a few weeks ago in which someone demonstrated a straight-forward computation of Fibonacci numbers which was not based on recursion and didn't use memoization. He effectively just remembered the last 2 numbers and kept adding them. This is an O(n) algorithm, but he implemented it very cleanly. So I quickly pointed out that a quicker way is to take advantage of the fact that they can be computed as powers of [[0,1],[1,1]] matrix and it requires only a O(log(N)) computation.

The problem, of course is that this is far from optimal past a certain point. It is efficient as long as the numbers are not too large, but they grow in length at the rate of N*log(phi)/log(10), where N is the Nth Fibonacci number and phi is the golden ratio ( (1+sqrt(5))/2 ~ 1.6 ). As it turns out, log(phi)/log(10) is very close to 1/5. So Nth Fibonacci number can be expected to have roughly N/5 digits.

Matrix multiplication, heck even number multiplication, gets very slow when the numbers start to have millions or billions of digits. So the F(100,000) took about .03 seconds to compute (in Python), while F(1000,000) took roughly 5 seconds. This is hardly O(log(N)) growth. My estimate was that this method, without improvements, only optimizes the computation to be O( (log(N)) ^ (2.5) ) or so.

Computing a billionth Fibonacci number, at this rate, would be prohibitively slow (even though it would only have ~ 1,000,000,000 / 5 digits so it easily fits within 32-bit memory).

Does anyone know of an implementation or algorithm which would allow a faster computation? Perhaps something which would allow calculation of a trillionth Fibonacci number.

And just to be clear, I am not looking for an approximation. I am looking for the exact computation (to the last digit).

Edit 1: I am adding the Python code to show what I believe is O( (log N) ^ 2.5) ) algorithm.

from operator import mul as mul
from time import clock

class TwoByTwoMatrix:
    __slots__ = "rows"

    def __init__(self, m):
        self.rows = m

    def __imul__(self, other):
        self.rows = [[sum(map(mul, my_row, oth_col)) for oth_col in zip(*other.rows)] for my_row in self.rows]
        return self

    def intpow(self, i):
        i = int(i)
        result = TwoByTwoMatrix([[long(1),long(0)],[long(0),long(1)]])
        if i <= 0:
            return result
        k = 0
        while i % 2 == 0:
            k +=1
            i >>= 1
        multiplier = TwoByTwoMatrix(self.rows)
        while i > 0:
            if i & 1:
                result *= multiplier
            multiplier *= multiplier # square it
            i >>= 1
        for j in xrange(k):
            result *= result
        return result


m = TwoByTwoMatrix([[0,1],[1,1]])

t1 = clock()
print len(str(m.intpow(100000).rows[1][1]))
t2 = clock()
print t2 - t1

t1 = clock()
print len(str(m.intpow(1000000).rows[1][1]))
t2 = clock()
print t2 - t1

Edit 2: It looks like I didn't account for the fact that len(str(...)) would make a significant contribution to the overall runtime of the test. Changing tests to

from math import log as log

t1 = clock()
print log(m.intpow(100000).rows[1][1])/log(10)
t2 = clock()
print t2 - t1

t1 = clock()
print log(m.intpow(1000000).rows[1][1])/log(10)
t2 = clock()
print t2 - t1

shortened the runtimes to .008 seconds and .31 seconds (from .03 seconds and 5 seconds when len(str(...)) were used).

Because M=[[0,1],[1,1]] raised to power N is [[F(N-2), F(N-1)], [F(N-1), F(N)]], the other obvious source of inefficiency was calculating (0,1) and (1,0) elements of the matrix as if they were distinct. This (and I switched to Python3, but Python2.7 times are similar):

class SymTwoByTwoMatrix():
    # elments (0,0), (0,1), (1,1) of a symmetric 2x2 matrix are a, b, c.
    # b is also the (1,0) element because the matrix is symmetric

    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

    def __imul__(self, other):
        # this multiplication does work correctly because we 
        # are multiplying powers of the same symmetric matrix
        self.a, self.b, self.c = \
            self.a * other.a + self.b * other.b, \
            self.a * other.b + self.b * other.c, \
            self.b * other.b + self.c * other.c
        return self

    def intpow(self, i):
        i = int(i)
        result = SymTwoByTwoMatrix(1, 0, 1)
        if i <= 0:
            return result
        k = 0
        while i % 2 == 0:
            k +=1
            i >>= 1
        multiplier = SymTwoByTwoMatrix(self.a, self.b, self.c)
        while i > 0:
            if i & 1:
                result *= multiplier
            multiplier *= multiplier # square it
            i >>= 1
        for j in range(k):
            result *= result
        return result

calculated F(100,000) in .006, F(1,000,000) in .235 and F(10,000,000) in 9.51 seconds.

Which is to be expected. It is producing results 45% faster for the fastest test and it is expected that the gain should asymptotically approach phi/(1+2*phi+phi*phi) ~ 23.6%.

The (0,0) element of M^N is actually N-2nd Fibonacci number:

for i in range(15):
    x = m.intpow(i)
    print([x.a,x.b,x.c])

gives

[1, 0, 1]
[0, 1, 1]
[1, 1, 2]
[1, 2, 3]
[2, 3, 5]
[3, 5, 8]
[5, 8, 13]
[8, 13, 21]
[13, 21, 34]
[21, 34, 55]
[34, 55, 89]
[55, 89, 144]
[89, 144, 233]
[144, 233, 377]
[233, 377, 610]

I would expect that not having to calculate element (0,0) would produce a speed up of additional 1/(1+phi+phi*phi) ~ 19%. But the lru_cache of F(2N) and F(2N-1) solution given by Eli Korvigo below actually gives a speed up of 4 times (ie, 75%). So, while I have not worked out a formal explanation, I am tempted to think that it caches the spans of 1's within the binary expansion of N and does the minimum number of multiplications necessary. Which obviates the need to find those ranges, precompute them and then multiply them at the right point in the expansion of N. lru_cache allows for a top-to-bottom computation of what would have been a more complicated buttom-to-top computation.

Both SymTwoByTwoMatrix and the lru_cache-of-F(2N)-and-F(2N-1) are taking roughly 40 times longer to compute every time N grows 10 times. I think that's possibly due to Python's implementation of multiplication of long ints. I think the multiplication of large numbers and their addition should be parallelizable. So a multi-threaded sub-O(N) solution should be possible even though (as Daniel Fisher states in comments) the F(N) solution is Theta(n).

Since `F(n)` has `Theta(n)` bits (digits in whatever base), you _cannot_ compute it faster than `O(n)`. — Daniel Fischer, Jul 18 '16 at 20:32
@Daniel Fischer, that's an interesting point. So for lower numbers you take advantage of the fact that multiplication is essentially a parallel operation of additions. So effectively, you need to parallelize multiplication of large numbers? — Dmitry Rubanovich, Jul 18 '16 at 20:37
Oi, *you* bounced it, this is still a CS question, not an SO question. It is off-topic here. — Martijn Pieters, Jul 18 '16 at 20:37
I'm closing this question as off-topic because it is not about a practical programming problem as outlined in the [help/on-topic]. — Martijn Pieters, Jul 18 '16 at 20:37
@Martijn Pieters. I actually erased it in CS. I am more interested in code than in an algorithm outline. I was planning to post my solution in an edit, but if you close it, I am not sure if I'll be able to. — Dmitry Rubanovich, Jul 18 '16 at 20:39
If you're looking for “streamlining of operations”, this is the right place. But if you're looking algorithms with better asymptotic performance, then what you're asking is a CS question. You'd be more likely to get good answers on [cs.se] (and less likely to attract the crowd that doesn't understand that floating point calculations are not exact) (but don't post Python code on [cs.se]). — Gilles 'SO- stop being evil', Jul 18 '16 at 20:50
@Gilles, actually I wasn't looking for streamlining of operations until Daniel Fischer's very perceptive comment. I think now I am looking for both. Although his comment mostly put an end to my thinking that better big-O is possible. Maybe it would better belong in Programming puzzles. I think the streamlining would require pre-sorting and pre-computing of consecutive ranges of 1's in binary representation of N. — Dmitry Rubanovich, Jul 18 '16 at 20:55
You may also be interested in the algorithm in [this answer](http://stackoverflow.com/a/6037936). — Daniel Fischer, Jul 18 '16 at 21:14
My primitive Haskell implementation computed a billionth number in less than a minute (also based on matrix multiplication), and 10,000,000,000th number in about 12 minutes. There's a faster implementation [here](https://wiki.haskell.org/The_Fibonacci_sequence) (3 minutes for a 10,000,000,000th number). The trillionth number would have roughly 200,000,000,000 digits. If you have a machine with 200 GB of RAM lying around, I'm pretty sure this algo would compute it in less than a week (increasing the argument by a factor of 10 increases computation time roughly 12-fold). — n. m. could be an AI, Jul 18 '16 at 21:22

Eli Korvigo · Answer 1 · 2018-01-04T20:21:22.737

7

Since Fibonacci sequence is a linear recurrence, its members can be evaluated in closed form. This involves computing a power, which can be done in O(logn) similarly to the matrix-multiplication solution, but the constant overhead should be lower. That's the fastest algorithm I know.

EDIT

Sorry, I missed the "exact" part. Another exact O(log(n)) alternative for the matrix-multiplication can be calculated as follows

from functools import lru_cache

@lru_cache(None)
def fib(n):
    if n in (0, 1):
        return 1
    if n & 1:  # if n is odd, it's faster than checking with modulo
        return fib((n+1)//2 - 1) * (2*fib((n+1)//2) - fib((n+1)//2 - 1))
    a, b = fib(n//2 - 1), fib(n//2)
    return a**2 + b**2

This is based on the derivation from a note by Prof. Edsger Dijkstra. The solution exploits the fact that to calculate both F(2N) and F(2N-1) you only need to know F(N) and F(N-1). Nevertheless, you are still dealing with long-number arithmetics, though the overhead should be smaller than that of the matrix-based solution. In Python you'd better rewrite this in imperative style due to the slow memoization and recursion, though I wrote it this way for the clarity of the functional formulation.

edited Jan 04 '18 at 20:21

answered Jul 18 '16 at 20:19

Eli Korvigo

10,265
6
47
73

1

maybe you could throw sympy at this form to get an analytic answer without rounding error of irrational numbers – Aaron Jul 18 '16 at 20:21
1

The key is precise computation. – Dmitry Rubanovich Jul 18 '16 at 20:23
You could probably use an arbitrary-precision arithmetic library and calculate how much precision you need to throw at the problem to get an exact result. Approximating the magnitude of the output in advance is easy, and for this problem, you only need a bit more precision than what's necessary to represent the magnitude to guarantee that the closest integer to the calculated result is the exact result. – user2357112 Jul 18 '16 at 20:58
@user2357112, If you re-read the question, I am precalculating the number of significant digits. The estimate is that Nth' Fibonacci number takes ~ N/5 digits. – Dmitry Rubanovich Jul 18 '16 at 21:01
@DmitryRubanovich Sorry, I missed the "exact" part. There is another exact logarithmic algorithm. Take a look on the update. – Eli Korvigo Jul 18 '16 at 22:07
Both formulas come out directly from squaring the matrix. That is if M=[[0,1],[1,1]], then F(N) = (M^N)[1][1] and F(N-1) = (M^N))[0][1]. So F(2N) = (M^(2N))[1][1] = ((M^N)^2)[1][1]. Similarly, for F(2N-1). The problem is that (as the question outlines), this operation requires *multiplying* very large numbers and that's an expensive operation when the numbers have millions of digits. – Dmitry Rubanovich Jul 18 '16 at 22:33
Although using lru_cache to avoid finding subsequence of consecutive 1 bits in the power maybe a *very* nice simplification. But the formula itself, essentially computes only (0,0), (0,1), and (1,1) elements of the matrix and ignores the (1,0) element because it will always be equal to (0,1). For large N, most of the time will be spent in going from N/2 to N because that's where the largest numbers will be multiplied. – Dmitry Rubanovich Jul 18 '16 at 22:49
@DmitryRubanovich yes, I wrote in the edit, that the solution doesn't remove the long-number arithmetics, after all you won't be able to get the exact answer without it. – Eli Korvigo Jul 18 '16 at 22:51
you will if you just return long(1) instead of 1 in the beginning. and don't use ** operator to square numbers. use x*x instead. – Dmitry Rubanovich Jul 18 '16 at 22:54
@DmitryRubanovich this is Python 3, there are no `long` and `int` anymore, there is only `int` (== `long` from Python2). Anyway, how does it help you avoid numbers exceeding 64 bits? I agree with the `x*x` part. – Eli Korvigo Jul 18 '16 at 22:57
then you should be ok with just removing the ** operator. it uses C's pow(), which returns a floating point number. x*x will keep it an integer. – Dmitry Rubanovich Jul 18 '16 at 22:59
@DmitryRubanovich sure, I've updated the answer. Thanks for the catch. – Eli Korvigo Jul 18 '16 at 23:02
Is it easy for you to break the simplification steps down? My maths are not that great I'm afraid. – Alexandros Spyropoulos Feb 20 '17 at 02:08
Also I do not understand why you have two different implementations. one for odd and one for even numbers. – Alexandros Spyropoulos Feb 20 '17 at 02:19
@AlexandrosSpyropoulos do you mean you don't understand Dijkstra's derivation I linked? – Eli Korvigo Feb 20 '17 at 15:44
@DmitryRubanovich Now that I've returned to this thread, I want to stress, that Python's built-in `**` operator does not use C's `pow`. In fact, it uses whatever is overloaded in an object's class. And for `int` instances it doesn't return a `float` when the exponent is `int`. I guess, you were thinking about `math.pow` (because it casts its argument to `double` straight away), and I was too lazy to give it a second thought. – Eli Korvigo Jan 04 '18 at 20:17
@Eli Korvigo, I did notice the same, but it was some time after this thread was active. I am not sure if this changed between different versions of Python. Somehow (somewhere in the very back of my mind) I do recall getting floating point results (when using **) when numbers got outside the range of what's supported by the range of values of ints. While sticking with plain arithmetic operators, I got results which were arbitrarily large integers. I am not sure if it is interesting enough to check. – Dmitry Rubanovich Jan 05 '18 at 09:24
@Eli Korvigo, I do recall that I was trying results both in 2.7.x and 3.x. And that I only switched to 3.x to take advantage of lru_cache (which I learned from you). – Dmitry Rubanovich Jan 05 '18 at 09:24

score 2 · Answer 2 · edited May 23 '17 at 12:00

This is too long for a comment, so I'll leave an answer.

The answer by Aaron is correct, and I've upvoted it, as should you. I will provide the same answer, and explain why it is not only correct, but the best answer posted so far. The formula we're discussing is:

Computing Φ is O(M(n)), where M(n) is the complexity of multiplication (currently a little over linearithmic) and n is the number of bits.

Then there's a power function, which can be expressed as a log (O(M(n)•log(n)), a multiply (O(M(n))), and an exp (O(M(n)•log(n)).

Then there's a square root (O(M(n))), a division (O(M(n))), and a final round (O(n)).

This makes this answer something like O(n•log^2(n)•log(log(n))) for n bits.

I haven't thoroughly analyzed the division algorithm, but if I'm reading this right, each bit might need a recursion (you need to divide the number log(2^n)=n times) and each recursion needs a multiply. Therefore it can't be better than O(M(n)•n), and that's exponentially worse.

Well, if you want to edit his answer to make clearer how to extend it beyond the double floating point precision, you could probably make it more clear. But I am not seeing it. — Dmitry Rubanovich, Jul 21 '16 at 08:46

score 2 · Answer 3 · edited May 23 '17 at 11:53

2

Using the weird square rooty equation in the other answer closed form fibo you CAN compute the kth fibonacci number exactly. This is because the $\sqrt(5)$ falls out in the end. You just have to arrange your multiplication to keep track of it in the meantime.

def rootiply(a1,b1,a2,b2,c):
    ''' multipy a1+b1*sqrt(c) and a2+b2*sqrt(c)... return a,b'''
    return a1*a2 + b1*b2*c, a1*b2 + a2*b1

def rootipower(a,b,c,n):
    ''' raise a + b * sqrt(c) to the nth power... returns the new a,b and c of the result in the same format'''
    ar,br = 1,0
    while n != 0:
        if n%2:
            ar,br = rootiply(ar,br,a,b,c)
        a,b = rootiply(a,b,a,b,c)
        n /= 2
    return ar,br

def fib(k):
    ''' the kth fibonacci number'''
    a1,b1 = rootipower(1,1,5,k)
    a2,b2 = rootipower(1,-1,5,k)
    a = a1-a2
    b = b1-b2
    a,b = rootiply(0,1,a,b,5)
    # b should be 0!
    assert b == 0
    return a/2**k/5

if __name__ == "__main__":
    assert rootipower(1,2,3,3) == (37,30) # 1+2sqrt(3) **3 => 13 + 4sqrt(3) => 39 + 30sqrt(3)
    assert fib(10)==55

edited May 23 '17 at 11:53

Community

1
1

answered Jan 20 '17 at 19:34

Him

5,257
3
26
83

I haven't quite read through it yet, but to help me out, are you implementing $\Q[\sqrt(5)]$ arithmetic? – Dmitry Rubanovich Jan 20 '17 at 20:41
Yeah, arithmetic in the space of a+b\sqrt(5) closed under addition and multiplication. It's a ring I guess? Is $Q[\sqrt(5)]$ the notation for that space? – Him Jan 20 '17 at 20:47
I think $\Q[\sqrt(5)]$ is a field, but it's constructed as a ring extension (and happens to be a field in this case). The R[X] is the notation for extension of a ring R by X, where X is not in R. – Dmitry Rubanovich Jan 20 '17 at 21:00
In that case, yes, this implements $Q[\sqrt(n)]$ arithmetic (where we specifically use n=5). The power function is O(log(k)) where k is the power you are raising the number to, which in this case is the fibonacci number you want to compute. Not sure if this actually offers any speedup over the other methods. – Him Jan 20 '17 at 21:15
I don't see what you are trying to do (compute in (ℤ, ℤ[φ])? To what advantage?). Please disambiguate your reference to another answer by providing a link to it (square brackets around a caption of your fancy, followed by the address (copy from "the `share` link" below the post) in parentheses). – greybeard Jan 21 '17 at 07:37
For one, I suppose, this is O(log(n)) with constant space requirements. The accepted answer has log(n) stuff on the stack by the end. Also, I would rather have a generic answer... this code is handy for computing the fibonacci numbers, but the same technique can be applied to many linear recurrence relations (ones with less than four terms should have closed forms with roots at the very least) – Him Jan 21 '17 at 14:55
@Scott, I still haven't read through it, but I think I should mention a few points. First, no solution will be O(log(n)) for n larger than 5*(sizeof any number whose computation is atomic). This is because the Fib(n) has approximately n/5 digits (because log(phi)/log(10) ~ .2). So multiplication involved in computing the last number will be the longest operation of the entire computation. Second, Eli Korvogo's solution takes advantage both of the formula which comes out of the linear recurrence matrix and his solution caches intermediate results. – Dmitry Rubanovich Jan 21 '17 at 23:11
There are no intermediate results that need caching in this solution. If you'll look through this solution there are no recursive calls. – Him Jan 21 '17 at 23:16
@Scott, (cont.) Eli's solution is, so far the cleanest and the fastest, so it attracted the most upvotes, but I haven't accepted it yet (even though I am tempted). I'll work though your solution to see if it's similar and if it gets around the multiplication of 2 large numbers as the last step. I was hoping there would be a solution which would allow parallelization of squaring of a large number. That would allow solutions faster than the solutions we are seeing now, which are O(n^( 1+x )), where x is somewhere between .1 and .5. – Dmitry Rubanovich Jan 21 '17 at 23:17
1

@Scott, ok, I read through your code and refactored the operations into an actual class which does Q[sqrt(5)] arithmetic. Using the class, I am consistently clocking 1.68 seconds for 1,000,000th fib number. Using the functions as you wrote them, I get 2.959 sec's for the same fib number. On the same system, using the SymMatrix method I outlined above, I get 0.12 secs for 1,000,000th fib number. So while your point about the cancellations is a good observation, it doesn't justify using this method of calculation beyond the numbers on which arithmetic can be done atomically. – Dmitry Rubanovich Jan 25 '17 at 06:01
@Scott, I am convinced that most of the calculations are actually spent in computing numbers which go away after the cancellations. There is also a lot of redundant calculations which can be eliminated by using a recurrence matrix to compute binomial coefficients and then not do any redundant computations of powers of 5 (of 25 really because every other one will get cancelled). But your program does expose the insight which leads to a different approach than using a recurrence matrix on the fib numbers themselves. So it is interesting from that stand point. – Dmitry Rubanovich Jan 25 '17 at 06:06

score -1 · Answer 4 · answered Jul 18 '16 at 20:13

-1

From Wikipedia,

For all n ≥ 0, the number Fn is the closest integer to phi^n/sqrt(5) where phi is the golden ratio. Therefore, it can be found by rounding, that is by the use of the nearest integer function

answered Jul 18 '16 at 20:13

Aaron

10,133
1
24
40

4

But phi and sqrt(5) are irrational. So their values would have loss of information at the tail end. I am looking for an exact computation. – Dmitry Rubanovich Jul 18 '16 at 20:16
You may have to accept then that as your required precision increases with the number of digits, then so does your execution time... – Aaron Jul 18 '16 at 20:18
this would be relatively easy to do in x86 i'm sure with some help from pre-built big numbers code – Aaron Jul 18 '16 at 20:19
This would require floating point arithmetic which is accurate to billions of digits or even trillions of digits. I don't think that's doable. We are not talking about numbers whose values are in billions or trillions, but whose precision is accurate to billionth or trillionth place. – Dmitry Rubanovich Jul 18 '16 at 20:23
no, I mean count up from 1,1,2,3,5 using x86... eliminate language overhead of a simple accumulate and store operation – Aaron Jul 18 '16 at 20:24
Well, going from F(1000,000) to F(2,000,000) with straight additions requires a million additions of numbers of magnitude 200,000-400,000. Doing it with simple multiplication of matrices, this is just 1 squaring. So it means 4 additions and 8 multiplications of numbers of that magnitude. Whether those multiplications overtake the computation and if that can be avoided (and possibly parallelized) is something I am not sure about. – Dmitry Rubanovich Jul 18 '16 at 20:29

Fast Fibonacci computation

4 Answers4

Linked