5

I want to find the fastest way to do the job of switch in C. I'm writing some Python code to replace C code, and it's all working fine except for a bottleneck. This code is used in a tight loop, so it really is quite crucial that I get the best performance.

Optimsation Attempt 1: First attempt, as per previous questions such as this suggest using hash tables for lookups. This ended up being incredibly slow.

Optimsation Attempt 2 Another optimisation I have made is to create a run of if ... return statements which gives me a 13% speed boost. It's still disappointingly slow.

Optimsation Attempt 3 I created an array.array of all possible input values, and did an index lookup. This results in an over-all speed up of 43%, which is respectable.

I'm running over an array.array using map and passing a transform function to it. This function is doing the lookup. My switch is working on short integers (it's a typed array). If this were GCC C, the compiler would create a jump table. It's frustrating to know that Python is either hashing my value to lookup a table entry or in the case of if, performing lots of comparisons. I know from profiling it that the slow functions are precisely the ones that are doing the look-up.

What is the absolute fastest way of mapping one integer to another, mapped over an array.array if relevant. Anything faster than the above?

EDIT

Although it makes me look like an idiot for only just realising, I will say it anwyay! Remember that running your code in a profiler slows your code down a lot. In my case, 19 times slower. Suddenly my bottleneck isn't so bad! Thanks very much everyone for all your answers. The question is still valid. I'll leave the question open for a bit because there may be some interesting answers.

With profiler, for my test set of data:

real    0m37.309s
user    0m33.263s
sys     0m4.002s

without:

real    0m2.595s
user    0m2.526s
sys     0m0.028s
Community
  • 1
  • 1
Joe
  • 46,419
  • 33
  • 155
  • 245
  • 1
    If you're worried about such minor details, Python performance won't do you any good anyhow.. use numpy? – Voo Dec 04 '11 at 17:36
  • That's the frustrating thing. I'm re-writing some C code into Python. Nearly all of it is perfect, it's just this one bottleneck which is holding it back. – Joe Dec 04 '11 at 17:51
  • I agree with @Voo that if you are worried about this kind of performance, probably a python implementation is the wrong way to go. Beside using NumPy, you could also consider turning your existing C code in a C extension to Python. It should be a trivial modification! – mac Dec 04 '11 at 17:55
  • FYI, there are tools to help do what mac is suggesting as well, for example, http://cython.org/ – Derek Litz Dec 04 '11 at 18:00
  • Can you do something as simple as define a list with the desired output at the position of its input? How many cases do you need to cover? – mtrw Dec 04 '11 at 18:09
  • @mac indeed I am re-writing a Python extension! I could put bits back, but it would be a shame not to banish all the C code. – Joe Dec 04 '11 at 18:10
  • @mtrw that was attempt 3, and yes that is the fastest so far – Joe Dec 04 '11 at 18:12
  • @Joe I'd seriously consider Cython then. It will require setting up your build and some additional framework, but will let you write the module in python while taking advantage of faster speeds. – David H. Clements Dec 04 '11 at 18:29
  • Performance is now acceptable (see my edit). – Joe Dec 04 '11 at 18:50

2 Answers2

2

I think others are right to suggest numpy or pure c; but for pure python, here are some timings, for what they're worth. Based on these, I'm a bit surprised that array.array performed so much better than a dict. Are you creating these tables on the fly inside the loop? Or have I misunderstood something else about your question? In any case, this suggests that a list is actually the best way to go.

>>> def make_lookup_func(table):
...     def lookup(val, t=table):
...         return t[val]
...     return lookup
... 
>>> lookup_tuple = make_lookup_func(tuple(range(10)))
>>> lookup_list = make_lookup_func(list(range(10)))
>>> lookup_array = make_lookup_func(array.array('i', range(10)))
>>> lookup_dict = make_lookup_func(dict(zip(range(10), range(10))))
>>> %timeit lookup_tuple(9)
10000000 loops, best of 3: 177 ns per loop
>>> %timeit lookup_list(9)
10000000 loops, best of 3: 158 ns per loop
>>> %timeit lookup_array(9)
10000000 loops, best of 3: 181 ns per loop
>>> %timeit lookup_dict(9)
10000000 loops, best of 3: 166 ns per loop

Scaling behavior:

>>> lookup_tuple = make_lookup_func(tuple(range(10000)))
>>> lookup_list = make_lookup_func(list(range(10000)))
>>> lookup_array = make_lookup_func(array.array('i', range(10000)))
>>> lookup_dict = make_lookup_func(dict(zip(range(10000), range(10000))))
>>> %timeit lookup_tuple(9000)
10000000 loops, best of 3: 177 ns per loop
>>> %timeit lookup_list(9000)
10000000 loops, best of 3: 158 ns per loop
>>> %timeit lookup_array(9000)
10000000 loops, best of 3: 186 ns per loop
>>> %timeit lookup_dict(9000)
10000000 loops, best of 3: 195 ns per loop
senderle
  • 145,869
  • 36
  • 209
  • 233
  • BTW, I also tried `array.array` with typecode `'h'` and it gave exactly the same results. – senderle Dec 04 '11 at 18:19
  • I actually created the inputs as hash tables for readability, then created an `array.array` from these as a one-time operation. I somehow assumed that a lookup would be quicker in an array (I imagine it would use pointer arithmetic vs something slower with PyObjects for a list). It's just a guess, I'll try it with a list too. – Joe Dec 04 '11 at 18:21
  • @Joe, in fact my understanding is that in cPython, `list`s use pointer arithmetic under the hood. (But I could be misremembering; it's been a while since I looked at the source code.) – senderle Dec 04 '11 at 18:31
1

Branch logic in general can be painfully slow in python when used in this type of application and you basically struck on one of the better ways of doing this for a tight inner loop where you are converting between integers. A few more things to experiment with:

You might try would be working with np.array or using Cython (or just straight C) for the tight loop. These require some additional setup (and possibly writing the inner loop in C), but can also give tremendous speedups for this type of application and can let you take advantage of a good C optimizer.

Something that can go either way and is more of a micro-optimization is that you could try using a list comprehension instead of a map, or make sure you aren't using a lambda in your map. Not using a lambda in a map() is actually a pretty big one, while the difference between a list comprehension and a map tends to be relatively small otherwise.

David H. Clements
  • 3,590
  • 2
  • 24
  • 26