13

I know this subject is well discussed but I've come around a case I don't really understand how the recursive method is "slower" than a method using "reduce,lambda,xrange".

def factorial2(x, rest=1):
    if x <= 1:
        return rest
    else:
        return factorial2(x-1, rest*x)


def factorial3(x):
    if x <= 1:
        return 1
    return reduce(lambda a, b: a*b, xrange(1, x+1))

I know python doesn't optimize tail recursion so the question isn't about it. To my understanding, a generator should still generate n amount of numbers using the +1 operator. So technically, fact(n) should add a number n times just like the recursive one. The lambda in the reduce will be called n times just as the recursive method... So since we don't have tail call optimization in both case, stacks will be created/destroyed and returned n times. And a if in the generator should check when to raise a StopIteration exception.

This makes me wonder why does the recursive method still slowlier than the other one since the recursive one use simple arithmetic and doesn't use generators.

In a test, I replaced rest*x by x in the recursive method and the time spent got reduced on par with the method using reduce.

Here are my timings for fact(400), 1000 times

factorial3 : 1.22370505333

factorial2 : 1.79896998405

Edit:

Making the method start from 1 to n doesn't help either instead of n to 1. So not an overhead with the -1.

Also, can we make the recursive method faster. I tried multiple things like global variables that I can change... Using a mutable context by placing variables in cells that I can modify like an array and keep the recursive method without parameters. Sending the method used for recursion as a parameter so we don't have to "dereference" it in our scope...?! But nothings makes it faster.

I'll point out that I have a version of the fact that use a forloop that is much faster than both of those 2 methods so there is clearly space for improvement but I wouldn't expect anything faster than the forloop.

Loïc Faure-Lacroix
  • 13,220
  • 6
  • 67
  • 99
  • I have the feeling that it's `range` being optimized. Replacing `range/xrange` by my own generator makes things much worse. Chances are that `range` is implemented in `c` and not in python which would explain why the recursive method cannot beat any method using `range`. Also, results could be much different using a different interpreter then... Like Pypy. – Loïc Faure-Lacroix Feb 09 '17 at 08:01
  • How much worse does your own generator make it? I tested it as well and while it was worse than Python's `xrange`, it was still much better than the recursive solution. – Stefan Pochmann Feb 09 '17 at 08:09
  • 1
    While the timing difference is probably significant, the two runtimes are still within the same order of magnitude. So `range`/`xrange` doing the adding/iterating in C vs. `factorial2` or your own generator doing it in python might well explain the difference. Additionally without tail recursion, `factorial2` has to build up the recursive call stack (which means iterative memory allocation) and each iteration has a branch, which might hurt optimizations such as pipelining. – das-g Feb 09 '17 at 08:10
  • @das-g You think that calling multiple times a lambda in an iterative way would make python reuse the stack it just destroyed? I guess that could be less heavy for the VM than create 100 stacks and then destroy 100 stacks. Because if you reuse memory you don't have to allocate that much. – Loïc Faure-Lacroix Feb 09 '17 at 08:14
  • @StefanPochmann "1.2329249382" vs "1.86193585396" With the python generator and the recursive method at "1.87423300743" – Loïc Faure-Lacroix Feb 09 '17 at 08:18
  • 3
    Your recursion variant is unnecessarily slowed down by the second parameter: return x * factorial2(x-1) is 10% faster. Your reduce variant is slowed down by the lambda: operator.mul saves 20%. – guidot Feb 09 '17 at 08:21
  • I realized that it got worse when I used a `while` loop vs a `for + range` loop. Thinking about it, I'd say it's `range` written in `c` + the overhead of allocating memory vs reallocating memory. – Loïc Faure-Lacroix Feb 09 '17 at 08:21
  • @guidot On my computer `operator.mul` makes it 50% faster. Practically on par with the for loop. – Loïc Faure-Lacroix Feb 09 '17 at 08:26

1 Answers1

7

The slowness of the recursive version comes from the need to resolve on each call the named (argument) variables. I have provided a different recursive implementation that has only one argument and it works slightly faster.

$ cat fact.py 
def factorial_recursive1(x):
    if x <= 1:
        return 1
    else:
        return factorial_recursive1(x-1)*x

def factorial_recursive2(x, rest=1):
    if x <= 1:
        return rest
    else:
        return factorial_recursive2(x-1, rest*x)

def factorial_reduce(x):
    if x <= 1:
        return 1
    return reduce(lambda a, b: a*b, xrange(1, x+1))

# Ignore the rest of the code for now, we'll get back to it later in the answer
def range_prod(a, b):
    if a + 1 < b:
        c = (a+b)//2
        return range_prod(a, c) * range_prod(c, b)
    else:
        return a
def factorial_divide_and_conquer(n):
    return 1 if n <= 1 else range_prod(1, n+1)

$ ipython -i fact.py 
In [1]: %timeit factorial_recursive1(400)
10000 loops, best of 3: 79.3 µs per loop
In [2]: %timeit factorial_recursive2(400)
10000 loops, best of 3: 90.9 µs per loop
In [3]: %timeit factorial_reduce(400)
10000 loops, best of 3: 61 µs per loop

Since in your example very large numbers are involved, initially I suspected that the performance difference might be due to the order of multiplication. Multiplying on every iteration a large partial product by the next number is proportional to the number of digits/bits in the product, so the time complexity of such a method is O(n2), where n is the number of bits in the final product. Instead it is better to use a divide and conquer technique, where the final result is obtained as a product of two approximately equally long values each of which is computed recursively in the same manner. So I implemented that version too (see factorial_divide_and_conquer(n) in the above code) . As you can see below it still loses to the reduce()-based version for small arguments (due to the same problem with named parameters) but outperforms it for large arguments.

In [4]: %timeit factorial_divide_and_conquer(400)
10000 loops, best of 3: 90.5 µs per loop
In [5]: %timeit factorial_divide_and_conquer(4000)
1000 loops, best of 3: 1.46 ms per loop
In [6]: %timeit factorial_reduce(4000)
100 loops, best of 3: 3.09 ms per loop

UPDATE

Trying to run the factorial_recursive?() versions with x=4000 hits the default recursion limit, so the limit must be increased:

In [7]: sys.setrecursionlimit(4100)
In [8]: %timeit factorial_recursive1(4000)
100 loops, best of 3: 3.36 ms per loop
In [9]: %timeit factorial_recursive2(4000)
100 loops, best of 3: 7.02 ms per loop
Community
  • 1
  • 1
Leon
  • 31,443
  • 4
  • 72
  • 97
  • 1
    Nice, I didn't think about doing that as I'm used to make recursive function tail call optimizable. I realize that in python it's pointless to do so. By making the recursive function as a tree process, it allows the method to not only work faster but make the stack tree much less deep avoiding maximum recursion depth. You could still reach it with very large numbers but that a clear improvement on my version. Also, as pointed out in the comments, I'd guess that you need to allocate less memory as the stack tree is less deep. Resulting in speed improvement. – Loïc Faure-Lacroix Feb 09 '17 at 09:46
  • Can you add timings for `factorial_recursive1` and `factorial_recursive2` with x=4000 as well? – Stefan Pochmann Feb 09 '17 at 10:05
  • @StefanPochmann I can't - they exceed the recursion depth limit – Leon Feb 09 '17 at 10:07
  • 1
    @Leon Can't you increase it? `import sys; sys.setrecursionlimit(4100)` – Stefan Pochmann Feb 09 '17 at 10:14
  • @StefanPochmann Well, I increased the recursion limit and they ran successfully. See updated answer. – Leon Feb 09 '17 at 10:17
  • @Leon Thanks. I think it nicely shows how `factorial_recursive2` suffers from using the bad multiplication order, as it gets even worse in comparison. Btw, I wonder whether it would be worth it to optimize your divide-and-conquer further, splitting the range into two ranges whose multiplication results are equally long... – Stefan Pochmann Feb 09 '17 at 10:23
  • 1
    @guidot you meant O(log(n)²). I considered that while writing the answer, but I think that my version is easier to comprehend. – Leon Feb 09 '17 at 10:26
  • @Leon: No , log(n²) = 2*log(n) and the two vanishes as constant under O() notation. – guidot Feb 09 '17 at 10:31
  • @guidot No, you confused yourself with your redefinition of n. It's O(log(n)²) with your definition of n (which you'd btw better provide). – Stefan Pochmann Feb 09 '17 at 10:40
  • @guidot Consider m=k! and n=log(m)=log(k!). Then, toward the end of the loop each multiplication will take O(n) time. Thus, the full complexity is O(n*k), which, due to [Stirling's approximation](https://en.wikipedia.org/wiki/Stirling's_approximation)), can be approximated as O(n²) or O(k²) (if we drop the slow O(log(k)) coefficients). – Leon Feb 09 '17 at 11:34
  • @guidot That O(n²) isn't for multiplying two big numbers but for the factorial, i.e., for multiplying all the small numbers. But it doesn't even matter. Even if it were about multiplying two big numbers n⋅n, it wouldn't be O(log(n)) but O(log(n)²). With the naive method we learned in school, that is. Karatsuba's algorithm for example is better, drops the exponent to about 1.585 instead of 2. – Stefan Pochmann Feb 09 '17 at 12:06
  • I'll accept the answer, since there is no other answer and that answer is gives a pretty good observation on the matter. – Loïc Faure-Lacroix Mar 07 '17 at 12:24