How to make this python script fast? (benchmarking related to branch prediction from a post from here)

Question

From here - a branching prediction problem, I started to write the Python version of the program to check the runtime of the sorted/unsorted versions in Python. I tried sorted first.

Here's the code:

import time

from random import *
arraysize = 327
data = []

for  c in range(arraysize):
    data.append( randint( 1 , 256 )  ) 


## !!! with this, the next loop runs faster
data.sort()

## test

start_time = time.clock()

sum = 0


for i in range(100000):
    for c in range(arraysize):
        if data[c] >= 128:
            sum += data[c]


print time.clock() - start_time

I'm not sure about the accuracy of my simple timing methodology, but it seems well enough. When I set arraysize = 32768 I waited for >20 mins the first time!! More than 20 minutes! But with arraysize = 327, i get a time of 8.141656691s.

Please correct me if I'm wrong somewhere in my code, or whether using Numpy/Scipy in someway would speed things up. Thanks.

For timing comparisons, use the [`timeit` module](http://docs.python.org/library/timeit.html); it makes the right choice of timer for you regardless of platform. — Martijn Pieters, Sep 21 '12 at 12:46
Also, the pythonic method to compute your sum, is to use the `sum()` function with a list comprehension (e.g. `sum(c for c in data if c >= 128)`). — Martijn Pieters, Sep 21 '12 at 12:47
@MartijnPieters -- That's not a list comprehension, that's a generator expression ;-) -- but of course, you knew that already. — mgilson, Sep 21 '12 at 12:49
Also, I doubt that the sorting will make a difference; the python bytecode evaluation loop does not exert much effort in branch prediction. — Martijn Pieters, Sep 21 '12 at 13:01
sir, did you try it out yourself? how much time is the script taking? 8s for arraysize=327 only is buggin' the hell out of me — bad_keypoints, Sep 21 '12 at 13:03
327 * 100000 = 32 700 000 Iterations, and that in an interpreter, roughly 4 000 000 items per second, thats not that bad, now is it? — ted, Sep 21 '12 at 13:07
and with mr mgilson's advice and changes to my code, i'm now getting 244s Better than letting the program run for >20m (but then i didn't have mr ted's perspective :) ) — bad_keypoints, Sep 21 '12 at 13:15

score 9 · Answer 1 · edited May 23 '17 at 12:19

I started with the answer by @mgilson and reworked it a bit. I wanted to test the "decision bit" and lookup table techniques as discussed in my answer to the original question: https://stackoverflow.com/a/17782979/166949

I made a few changes to the original. Some were just style things that reflect my personal preference. However, I discovered bugs in the original, and I think it's important to measure the correct code.

I made the Python code now take arguments from the command line, and wrote a shell script that runs the Python script using Python 2.x, Python 3.x, and PyPy. The exact versions were: Python 2.7.6, Python 3.4.0, and PyPy 2.7.3

I ran the tests on Linux Mint 17, 64-bit version. The CPU is an AMD Phenom 9850 running at 2.5 GHz with 16 GB of RAM. The Linux kernel version, per uname -r, is: 3.13.0-24-generic

The reason I made the code take arguments from the command line is that 327 is a pretty short list. I figured that the sum() and generator expressions would do better when the list was much longer, so I made list size and number of trials be passed from the command line. The results shows which Python, and the list length and number of trials.

Conclusion: to my surprise, even with a long list, Python was fastest using sum() with a list comprehension! There is some overhead to running a generator that seems to be slower than the overhead of building a list and then tearing it down.

However, if the list got truly large, I expected the generator would begin to out-perform the list comprehension. With a list of a million random values, the listcomp was still faster, but when I went to 16 million random values, the genexp became faster. And the speed penalty of the generator expression is not large for shorter lists. So I still favor the generator expression as the go-to way to solve this problem in Python.

Interestingly, PyPy was fastest with the table lookup. This makes sense: that was the fastest way I found in C, and PyPy is generating native code from the JIT.

For CPython, with its virtual machine, it is faster to invoke a single operation than several operations; the overhead of the Python VM can outweigh a more expensive fundamental operation. Thus integer division is faster than bit masking plus bit shifting, because the division is a single operation. But in PyPy, the bit masking+shifting is much faster than the division.

Also, in CPython, using sum() lets your code run in the C internals so it can be very fast; but in PyPy, sum() is slower than just writing a straighforward loop that the JIT can turn into a wicked fast native loop. My guess is that the generator machinery is hard for PyPy to grok and optimize away into native code.

The shell script:

for P in python python3 pypy; do
    echo "$P ($1, $2)"
    $P test_branches.py $1 $2
    echo
done

The Python code:

import random
import sys
import timeit

try:
    RANGE = xrange
except NameError:
    RANGE = range

if len(sys.argv) != 3:
    print("Usage: python test_branches.py <length_of_array> <number_of_trials>")
    sys.exit(1)

TEST_DATA_LEN = int(sys.argv[1])
NUM_REPEATS = int(sys.argv[2])

_test_data = [random.randint(0,255) for _ in RANGE(TEST_DATA_LEN)]

def test0(data):
    """original way"""
    total = 0
    for i in RANGE(TEST_DATA_LEN):
        if data[i] >= 128:
            total += data[i]
    return total


def test1(data):
    """better loop"""
    total = 0
    for n in data:
        if n >= 128:
            total += n
    return total

def test2(data):
    """sum + generator"""
    return sum(n for n in data if n >= 128)

def test3(data):
    """sum + listcomp"""
    return sum([n for n in data if n >= 128])

def test4(data):
    """decision bit -- bit mask and shift"""
    lst = [0, 0]
    for n in data:
        lst[(n & 0x80) >> 7] += n
    return lst[1]

def test5(data):
    """decision bit -- division"""
    lst = [0, 0]
    for n in data:
        lst[n // 128] += n
    return lst[1]

_lut = [0 if n < 128 else n for n in RANGE(256)]

def test6(data):
    """lookup table"""
    total = 0
    for n in data:
        total += _lut[n]
    return total

def test7(data):
    """lookup table with sum()"""
    return sum(_lut[n] for n in data)

test_functions = [v for k,v in globals().items() if k.startswith("test")]
test_functions.sort(key=lambda x: x.__name__)

correct = test0(_test_data)

for fn in test_functions:
    name = fn.__name__
    doc = fn.__doc__
    if fn(_test_data) != correct:
        print("{}() not correct!!!".format(name))
    s_call = "{}(_test_data)".format(name)
    s_import = "from __main__ import {},_test_data".format(name)
    t = timeit.timeit(s_call,s_import,number=NUM_REPEATS)
    print("{:7.03f}: {}".format(t, doc))

The results:

python (327, 100000)
  3.170: original way
  2.211: better loop
  2.378: sum + generator
  2.188: sum + listcomp
  5.321: decision bit -- bit mask and shift
  4.977: decision bit -- division
  2.937: lookup table
  3.464: lookup table with sum()

python3 (327, 100000)
  5.786: original way
  3.444: better loop
  3.286: sum + generator
  2.968: sum + listcomp
  8.858: decision bit -- bit mask and shift
  7.056: decision bit -- division
  4.640: lookup table
  4.783: lookup table with sum()

pypy (327, 100000)
  0.296: original way
  0.312: better loop
  1.932: sum + generator
  1.011: sum + listcomp
  0.172: decision bit -- bit mask and shift
  0.613: decision bit -- division
  0.140: lookup table
  1.977: lookup table with sum()


python (65536, 1000)
  6.528: original way
  4.661: better loop
  4.974: sum + generator
  4.150: sum + listcomp
 10.971: decision bit -- bit mask and shift
 10.218: decision bit -- division
  6.052: lookup table
  7.070: lookup table with sum()

python3 (65536, 1000)
 12.999: original way
  7.618: better loop
  6.826: sum + generator
  5.587: sum + listcomp
 19.326: decision bit -- bit mask and shift
 14.917: decision bit -- division
  9.779: lookup table
  9.575: lookup table with sum()

pypy (65536, 1000)
  0.681: original way
  0.884: better loop
  2.640: sum + generator
  2.642: sum + listcomp
  0.316: decision bit -- bit mask and shift
  1.573: decision bit -- division
  0.280: lookup table
  1.561: lookup table with sum()


python (1048576, 100)
 10.371: original way
  7.065: better loop
  7.910: sum + generator
  6.579: sum + listcomp
 17.583: decision bit -- bit mask and shift
 15.426: decision bit -- division
  9.285: lookup table
 10.850: lookup table with sum()

python3 (1048576, 100)
 20.435: original way
 11.221: better loop
 10.162: sum + generator
  8.981: sum + listcomp
 29.108: decision bit -- bit mask and shift
 23.626: decision bit -- division
 14.706: lookup table
 14.173: lookup table with sum()

pypy (1048576, 100)
  0.985: original way
  0.926: better loop
  5.462: sum + generator
  6.623: sum + listcomp
  0.527: decision bit -- bit mask and shift
  2.334: decision bit -- division
  0.481: lookup table
  5.800: lookup table with sum()


python (16777216, 10)
 15.704: original way
 11.331: better loop
 11.995: sum + generator
 13.787: sum + listcomp
 28.527: decision bit -- bit mask and shift
 25.204: decision bit -- division
 15.349: lookup table
 17.607: lookup table with sum()

python3 (16777216, 10)
 32.822: original way
 18.530: better loop
 16.339: sum + generator
 14.999: sum + listcomp
 47.755: decision bit -- bit mask and shift
 38.930: decision bit -- division
 23.704: lookup table
 22.693: lookup table with sum()

pypy (16777216, 10)
  1.559: original way
  2.234: better loop
  6.586: sum + generator
 10.931: sum + listcomp
  0.817: decision bit -- bit mask and shift
  3.714: decision bit -- division
  0.752: lookup table
  3.837: lookup table with sum()

You probably ought to mention your OS if you're going to be this specific as that can make a big difference too. — mgilson, Jun 25 '14 at 03:36
Okay, I added full info on OS and also on the hardware used to run the tests. — steveha, Jun 25 '14 at 03:57
Looks good to me. +1 for the attention to detail (and for clearly besting my answer ;-) — mgilson, Jun 25 '14 at 04:00
I don't view this as a competition! And your answer paved the road for mine; if my answer has advanced a little farther, it is because it has stood on the shoulders of a giant (your answer). The key difference was that I added a test that each function returned the same value; where possible, make your code test *itself* and it will find the bugs for you. Also, testing under Python 3.x would have revealed the bug (since Python 3.x doesn't define the operation of comparing an integer to a list). Thanks for the +1. — steveha, Jun 25 '14 at 04:05

mgilson · Answer 2 · 2014-06-25T03:29:16.923

One small optimization, which is also a matter of style, lists can be iterated over directly instead of needing to index them:

for d in data:
    if d >= 128:
        sum += d

This saves a few function calls.

This loop also might go faster if you use the builtin sum function:

my_sum += sum( d for d in data if d>=128 )

A list comp may be faster than the above generator (at the expense of a little extra memory):

my_sum += sum( [d for d in data if d>=128] )

Of course, from an algorithm perspective, we can get rid of the outmost loop since the sum of the inner loop isn't going to change:

my_sum = 100000 * sum( d for d in data if d>=128 )

but I'm guessing you already knew that...

Finally, here's how I would benchmark this:

import random
import timeit

N = 327
testarr = [random.randint(1,256) for _ in range(N)]

def test1(data):
    """Original way"""
    s = 0
    for c in range(N):
        if data[c] >= 128:
            s += data[c]


def test2(data):
    """better loop"""
    s = 0
    for d in data:
        if d >= 128:
            s += d

def test3(data):
    """ sum + generator """
    sum( d for d in data if d >= 128 )

def test4(data):
    """ sum + listcomp """
    sum( [ d for d in data if d >= 128 ])

NNUMBER = 100000
print timeit.timeit('test1(testarr)','from __main__ import test1,testarr',number=NNUMBER)
print timeit.timeit('test2(testarr)','from __main__ import test2,testarr',number=NNUMBER)
print timeit.timeit('test3(testarr)','from __main__ import test3,testarr',number=NNUMBER)
print timeit.timeit('test4(testarr)','from __main__ import test4,testarr',number=NNUMBER)

My results (OS-X Mavericks, python 2.7.5 -- mileage may vary):

2.80526804924  # loop with range
2.04129099846  # loop over elements
1.91441488266  # sum with generator
2.05234098434  # sum with list

For me, the generator with sum wins by a small margin. The list-comp with sum and the explicit loop are about equal. Looping over indices is (not surprisingly) the slowest.

did you try with `arraysize = 32768` without ridding the outer loop? how much time did you get? — bad_keypoints, Sep 21 '12 at 12:56
is this what you're talking about? `for i in range(100000): for c in range(arraysize): my_sum += sum( d for d in data if d>=128 )` — bad_keypoints, Sep 21 '12 at 13:05
the outermost loop is for timing, read the corresponding post, there are some insights that the intel compiler can do something pretty similar to what you suppose as an automatic optimization — ted, Sep 21 '12 at 13:05
@mgilson oh no that's wrong. i have to remove the inner loop. sorry! — bad_keypoints, Sep 21 '12 at 13:06
i'm getting now ~2.4s (for arraysize=327) thanks. that's still a great improvement on my code. — bad_keypoints, Sep 21 '12 at 13:12
In `test3()` and `test4()` there is a bug: the code compares `data` to 128 rather than comparing each value. (`data` is the whole list of random numbers.) I fixed the bug in my rewrite; see my answer. — steveha, Jun 25 '14 at 03:21
It's interesting to me that `sum()` with a generator wins in your test, while in my tests a list comprehension was usually faster. I guess this is a Mac vs. Linux difference. I wonder if listcomps are unusually good on Linux, or if generators are unusually good on Macs, or what! Overall, I think the conventional wisdom is vindicated: use `sum()` plus a generator expression to solve this problem on Python. Unless you know you will be running on PyPy, in which case write a simple `for` loop that the JIT can turn to a native code loop. — steveha, Jun 25 '14 at 04:00

How to make this python script fast? (benchmarking related to branch prediction from a post from here)

2 Answers2