Speed of "sum" comprehension in Python

Question

I was under the impression that using a sum construction was much faster than running a for loop. However, in the following code, the for loop actually runs faster:

import time

Score = [[3,4,5,6,7,8] for i in range(40)]

a=[0,1,2,3,4,5,4,5,2,1,3,0,5,1,0,3,4,2,2,4,4,5,1,2,5,4,3,2,0,1,1,0,2,0,0,0,1,3,2,1]

def ver1():
    for i in range(100000):
        total = 0
        for j in range(40):
            total+=Score[j][a[j]]
    print (total)

def ver2():
    for i in range(100000):
        total = sum(Score[j][a[j]] for j in range(40))
    print (total)


t0 = time.time()
ver1()
t1 = time.time()
ver2()
t2 = time.time()

print("Version 1 time: ", t1-t0)
print("Version 2 time: ", t2-t1)

The output is:

208
208
Version 1 time:  0.9300529956817627
Version 2 time:  1.066061019897461

Am I doing something wrong? Is there a way to do this faster?

(Note that this is just a demo I set up, in my real application the scores will not be repeated in this manner)

Some additional info: This is run on Python 3.4.4 64-bit, on Windows 7 64-bit, on an i7.

[this question](http://stackoverflow.com/questions/24578896/python-built-in-sum-function-vs-for-loop-performance) says that `sum` should be much faster than a `for` loop. — Barmar, Feb 04 '16 at 03:28
I think your bottleneck is the list comprehension, not the addition. — Barmar, Feb 04 '16 at 03:30
@Barmar There is no list comprehension in either of the functions. Why should a *generator comprehension* be a bottleneck, since it's doing pretty much exactly the same as a for loop? I think it might just be the overhead of the function calls to `sum`, because the range is so small... — L3viathan, Feb 04 '16 at 03:33
I guess you should look at the generated code to see what's different. — Barmar, Feb 04 '16 at 03:36
@L3viathan: Also, generator expressions are closures; the lookup of `Score` and `a` on each loop is slightly more expensive because it has to check enclosing scope, then global, rather than just going straight to global. Stupid caching hacks can often help there, e.g. changing the `sum` genexpr to `total = sum(Score[j][a[j]] for Score, a in ((Score, a),) for j in range(40))` would make the two collections cache to local lookup once and avoid 160ish (maybe only 80ish, not sure) dict lookups for variable names. — ShadowRanger, Feb 04 '16 at 03:49
@L3viathan, "since [a generator is] doing pretty much exactly the same as a for loop". A generator comprehension is more like a generator function than a loop. It's relatively much slower than a for loop. — Ben Graham, Feb 04 '16 at 03:53
@BenGraham: In Python, not really. Generator expressions often beat a `for` loop, but this case looks likely to be pathological, because most of the data being used isn't captured in the closure. Python `for` loops are slow enough that the overhead for generators is actually lost in the noise most of the time. Of course, the real advantage to `sum` is when you can avoid explicit loops and genexprs completely; `total = sum(range(40))` blows `total = 0` `for i in range(40): total += i` out of the water (though it's obviously contrived example). — ShadowRanger, Feb 04 '16 at 03:57
@ShadowRanger I had done some profiling to satisfy myself before posting that comment, but I now notice that python 2 (my system default) favours loops while python 3 favours generators. Guess that teaches me for doing off-the-cuff profiling. edit: I can think of a few Project Euler problems which can be solved by `sum(range(n))` :) — Ben Graham, Feb 04 '16 at 04:03
@BenGraham: I'm pretty sure `itertools.accumulate` in 3.3 was created primarily to solve certain Project Euler problems (combined with `itertools.count`, it's a trivial way to make triangular, pentagonal and hexagonal number generators that run at the C layer for crazy speed). :-) — ShadowRanger, Feb 04 '16 at 04:07

score 2 · Answer 1 · edited Feb 05 '16 at 01:18

This seems to depend on the system, probably the python version. On my system, the difference is is about 13%:

python sum.py 
208
208
('Version 1 time: ', 0.6371259689331055)
('Version 2 time: ', 0.7342419624328613)

The two version are not measuring sum versus manual looping because the loop "bodies" are not identical. ver2 does more work because it creates the generator expression 100000 times, while ver1's loop body is almost trivial, but it creates a list with 40 elements for every iteration. You can change the example to be identical, and then you see the effect of sum:

def ver1():
    r = [Score[j][a[j]] for j in range(40)]
    for i in xrange(100000):
        total = 0
        for j in r:
            total+=j
    print (total)

def ver2():
    r = [Score[j][a[j]] for j in xrange(40)]
    for i in xrange(100000):
        total = sum(r)
    print (total)

I've moved everything out of the inner loop body and out of the sum call to make sure that we are measuring only the overhead of hand-crafted loops. Using xrange instead of range further improves the overall runtime, but this applies to both versions and thus does not change the comparison. The results of the modified code on my system is:

python sum.py
208
208
('Version 1 time: ', 0.2034609317779541)
('Version 2 time: ', 0.04234910011291504)

ver2 is five times faster than ver1. This is the pure performance gain of using sum instead of a hand-crafted loop.

Inspired by ShadowRanger's comment on the question about lookups, I have modified the example to compare the original code and check if the lookup of bound symbols:

def gen(s,b):
    for j in xrange(40):
        yield s[j][b[j]]

def ver2():
    for i in range(100000):
        total = sum(gen(Score, a))
    print (total)

I create a small custom generator which locally binds Score and a to prevent expensive lookups in parent scopes. Executing this:

python sum.py
208
208
('Version 1 time: ', 0.6167840957641602)
('Version 2 time: ', 0.6198039054870605)

The symbol lookups alone account for ~12% of the runtime.

I don't think this is a fair comparison. The point is to find the fastest way to calculate the sum, and by removing the intermediate list from the loop you have essentially precomputed part of the sum and only showed that using "sum" on a flat list is faster than iterating over it. — CaptainCodeman, Feb 04 '16 at 11:10
@CaptainCodeman If you want to compare the speed of `sum` vs. a hadn-written loop, you should take everything else out. Otherwise you are comparing apples and oranges. Otherwise, I would strongly argue that your code for ver2 is just sub-optimal. — Jens, Feb 04 '16 at 12:55
If you just want to sum a flat array, then sure, "sum" is faster, but that was never under question. This is testing it for a more complicated application that could arise. There is no way to flatten the array in practice. — CaptainCodeman, Feb 04 '16 at 12:57

PaulMcG · Answer 2 · 2016-02-04T04:40:09.083

1

Since j is iterating over both lists, I thought I'd see if zip worked any better:

def ver3():
    for i in range(100000):
        total = sum(s[i] for s,i in zip(Score,a))
    print (total)

On Py2 this runs about 30% slower than version 2, but on Py3 about 20% faster than version 1. If I change zip to izip (imported from itertools), this cuts the time down to between versions 1 and 2.

edited Feb 04 '16 at 04:40

answered Feb 04 '16 at 04:28

PaulMcG

62,419
16
94
130

1

Heck, if you want to get clever and remove all Python byte code execution from the equation during the `sum`: `from operator import getitem`, `total = sum(map(getitem, Score, a))` would probably do even better (on Py2, use `itertools.imap` to avoid intermediate `list`). – ShadowRanger Feb 04 '16 at 04:48

Speed of "sum" comprehension in Python

2 Answers2

Linked