Why is this iterative Collatz method 30% slower than its recursive version in Python?

Question

Prelude

I have two implementations for a particular problem, one recursive and one iterative, and I want to know what causes the iterative solution to be ~30% slower than the recursive one.

Given the recursive solution, I write an iterative solution making the stack explicit.
Clearly, I simply mimic what the recursion is doing, so of course the Python engine is better optimized to handle the bookkeeping. But can we write an iterative method with similar performance?

My case study is Problem #14 on Project Euler.

Find the longest Collatz chain with a starting number below one million.

Code

Here is a parsimonious recursive solution (credit due to veritas in the problem thread plus an optimization from jJjjJ):

def solve_PE14_recursive(ub=10**6):
    def collatz_r(n):
        if not n in table:
            if n % 2 == 0:
                table[n] = collatz_r(n // 2) + 1
            elif n % 4 == 3:
                table[n] = collatz_r((3 * n + 1) // 2) + 2
            else:
                table[n] = collatz_r((3 * n + 1) // 4) + 3
        return table[n]
    table = {1: 1}
    return max(xrange(ub // 2 + 1, ub, 2), key=collatz_r)

Here's my iterative version:

def solve_PE14_iterative(ub=10**6):
    def collatz_i(n):
        stack = []
        while not n in table:
            if n % 2 == 0:
                x, y = n // 2, 1
            elif n % 4 == 3:
                x, y = (3 * n + 1) // 2, 2
            else:
                x, y = (3 * n + 1) // 4, 3
            stack.append((n, y))
            n = x
        ysum = table[n]
        for x, y in reversed(stack):
            ysum += y
            table[x] = ysum
        return ysum
    table = {1: 1}
    return max(xrange(ub // 2 + 1, ub, 2), key=collatz_i)

And the timings on my machine (i7 machine with lots of memory) using IPython:

In [3]: %timeit solve_PE14_recursive()
1 loops, best of 3: 942 ms per loop
In [4]: %timeit solve_PE14_iterative()
1 loops, best of 3: 1.35 s per loop

Comments

The recursive solution is awesome:

Optimized to skip a step or two depending on the two least significant bits.
My original solution didn't skip any Collatz steps and took ~1.86 s
It is difficult to hit Python's default recursion limit of 1000.
collatz_r(9780657630) returns 1133 but requires less than 1000 recursive calls.
Memoization avoids retracing
collatz_r length calculated on-demand for max

Playing around with it, timings seem to be precise to +/- 5 ms.
Languages with static typing like C and Haskell can get timings below 100 ms.
I put the initialization of the memoization table in the method by design for this question, so that timings would reflect the "re-discovery" of the table values on each invocation.

collatz_r(2**1002) raises RuntimeError: maximum recursion depth exceeded.
collatz_i(2**1002) happily returns with 1003.

I am familiar with generators, coroutines, and decorators.
I am using Python 2.7. I am also happy to use Numpy (1.8 on my machine).

What I am looking for

an iterative solution that closes the performance gap
discussion on how Python handles recursion
the finer details of the performance penalties associated with an explicit stack

I'm looking mostly for the first, though the second and third are very important to this problem and would increase my understanding of Python.

Good question. I have observed the same behaviour while exploring a node-tree. — bgusach, Feb 11 '14 at 07:36
I suspect that the implicit stack in the recursive method (i.e. the call stack) is using very efficient c, and gives more performance than the explicit stack using the python list (which has to do a lot more work than just doing a couple of operations the stack/frame pointer). — cmh, Feb 11 '14 at 15:46
Have you tried using a [deque](http://docs.python.org/2/library/collections.html#collections.deque) as your stack? — nmclean, Feb 11 '14 at 16:24
@cmh Good call, that note was actually in my first revision, but fell out on my subsequent touch-up. @nmclean Actually yes. The `reversed(stack)` tipped me off. I don't remember what the result was, but I'll report back later. — Prashant Kumar, Feb 11 '14 at 16:53

score 10 · Answer 1 · edited May 23 '17 at 10:33

Here's my shot at a (partial) explanation after running some benchmarks, which confirm your figures.

While recursive function calls are expensive in CPython, they aren't nearly as expensive as emulating a call stack using lists. The stack for a recursive call is a compact structure implemented in C (see Eli Bendersky's explanation and the file Python/ceval.c in the source code).

By contrast, your emulated stack is a Python list object, i.e. a heap-allocated, dynamically growing array of pointers to tuple objects, which in turn point to the actual values; goodbye, locality of reference, hello cache misses. You then use Python's notoriously slow iteration on these objects. A line-by-line profiling with kernprof confirms that iteration and list handling are taking a lot of time:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                               @profile
    17                                               def collatz_i(n):
    18    750000       339195      0.5      2.4          stack = []
    19   3702825      1996913      0.5     14.2          while not n in table:
    20   2952825      1329819      0.5      9.5              if n % 2 == 0:
    21    864633       416307      0.5      3.0                  x, y = n // 2, 1
    22   2088192       906202      0.4      6.4              elif n % 4 == 3:
    23   1043583       617536      0.6      4.4                  x, y = (3 * n + 1) // 2, 2
    24                                                       else:
    25   1044609       601008      0.6      4.3                  x, y = (3 * n + 1) // 4, 3
    26   2952825      1543300      0.5     11.0              stack.append((n, y))
    27   2952825      1150867      0.4      8.2              n = x
    28    750000       352395      0.5      2.5          ysum = table[n]
    29   3702825      1693252      0.5     12.0          for x, y in reversed(stack):
    30   2952825      1254553      0.4      8.9              ysum += y
    31   2952825      1560177      0.5     11.1              table[x] = ysum
    32    750000       305911      0.4      2.2          return ysum

Interestingly, even n = x takes around 8% of the total running time.

(Unfortunately, I couldn't get kernprof to produce something similar for the recursive version.)

+1 The "goodbye, locality of reference, hello cache misses" was elucidating for me. I knew Python lists are allocated on the heap, like an `std::vector` in C++, but didn't make the obvious connection that it's a problem here. Too bad I don't get locality back if I have two lists and pushing numbers. Love the numbers too. While writing it, I did feel `n = x` itself was taking a significant amount of time, and you quantified it! Too bad we don't get `kernprof` for the recursive version indeed. — Prashant Kumar, Feb 11 '14 at 16:51

Janne Karila · Answer 2 · 2014-02-11T21:52:48.947

Iterative code is sometimes faster than recursive because it avoids function call overhead. However, stack.append is also a function call (and an attribute lookup on top) and adds similar overhead. Counting the append calls, the iterative version makes just as many function calls as the recursive version.

Comparing the first two and the last two timings here...

$ python -m timeit pass
10000000 loops, best of 3: 0.0242 usec per loop
$ python -m timeit -s "def f(n): pass" "f(1)"
10000000 loops, best of 3: 0.188 usec per loop
$ python -m timeit -s "def f(n): x=[]" "f(1)"
1000000 loops, best of 3: 0.234 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append" "f(1)"
1000000 loops, best of 3: 0.336 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append(1)" "f(1)"
1000000 loops, best of 3: 0.499 usec per loop

...confirms that the append call excluding attribute lookup takes approximately the same time as calling a minimal pure Python function, ~170 ns.

From the above I conclude that the iterative version does not enjoy an inherent advantage. The next question to consider is which one does more work. To get a (very) rough estimate, we can look at the number of lines executed in each version. I did a quick experiment to find out that:

collatz_r is called 1234275 times, and the body of the if block executes 984275 times.
collatz_i is called 250000 times, and the while loop runs 984275 times

Now, let's say collatz_r has 2 lines outside the if and 4 lines inside (that are executed in the worst case, when the else is hit). That adds up to 6.4 million lines to execute. Comparable figures for collatz_i could be 5 and 9, which add up to 10.0 million.

Even if that was just a rough estimate, it is well enough in line with the actual timings.

Why is this iterative Collatz method 30% slower than its recursive version in Python?

Prelude

Code

Comments

What I am looking for

2 Answers2