Passing iterators to any for execution for speed and Why?

Question

Questions are summarized here. Yes, I know some of these answers ;) and I can do some hand-waving on others, but I'd really like to get to the nitty-gritty here.

Is this even a good idea at all? (This one is not below)
I wonder if the map actually adds speed improvements? Why?
Why in the world would passing iterators to any make my code faster?
Why did my Counter object work and my print_true function fail miserably?
Is there an equivalent to itertools.imap that will just call a function over and over again, and optionally a certain amount of times?
Where is my carrot?!?

I just watched PyCon 2011: How Dropbox Did It and How Python Helped (admittedly I skipped through most of the parts), but FINALLY the really interesting stuff started at around 22:23.

The speaker advocated making your inner loops in C and that "run once" stuff doesn't need much optimization (make sense)... then he goes on to state... paraphrased:

Pass a composition of iterators to any for massive speed improvements.

Here is the code (hopefully it's identical):

import itertools, hashlib, time   
_md5 = hashlib.md5()  
def run():
    for i in itertools.repeat("foo", 10000000):
        _md5.update(i)
a = time.time();  run(); time.time() - a  
Out[118]: 9.44077205657959

_md5 = hashlib.md5() 
def run():
    any(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))    
a = time.time();  run(); time.time() - a
Out[121]: 6.547091007232666

Hmm, looks like for even greater speed improvements I can just get a faster computer! (Judging from his slide.)

He then does a bunch of hand-waving without actually going into details as to why.

I already knew about the iterators from the answer to pythonic way to do something N times without an index variable? thanks to Alex Martelli.

Then I thought, I wonder if the map actually adds speed improvements? My final thought was WTF??? passing to any? REALLY??? Surely that can't be right since the documentation defines any as:

def any(iterable):
    for element in iterable:
        if element:
            return True
    return False

Why in the world would passing iterators to any make my code faster?

I then tested it using the following (among many other tests) but this is what gets me:

def print_true(x):
    print 'True'
    return 'Awesome'

def test_for_loop_over_iter_map_over_iter_repeat():
    for result in itertools.imap(print_true, itertools.repeat("foo", 5)):
        pass

def run_any_over_iter_map_over_iter_repeat():
    any(itertools.imap(print_true, itertools.repeat("foo", 5)))

And the runs:

    In [67]: test_for_loop_over_iter_map_over_iter_repeat()
    True
    True
    True
    True
    True

    In [74]: run_any_over_iter_map_over_iter_repeat()
    True

For shame. I declared this GUY IS FULL OF IT. Heretic! But, I calmed down and continued to test. If this were true how in the hell could Dropbox even work!?!?

And with further testing it did work... I initially just used a simple counter object, and it counted all the way up to 10000000 in both cases.

So the question is why did my Counter object work and my print_true function fail miserably?

class Counter(object):
    count = 0
    def count_one(self, none):
        self.count += 1

def run_any_counter():
    counter = Counter()
    any(itertools.imap(counter.count_one, itertools.repeat("foo", 10000000)))
    print counter.count

def run_for_counter():
    counter = Counter()
    for result in itertools.imap(counter.count_one, itertools.repeat("foo", 10000000)):
        pass
    print counter.count

output:

%time run_for_counter()
10000000
CPU times: user 5.54 s, sys: 0.03 s, total: 5.57 s
Wall time: 5.68 s

%time run_any_counter()
10000000
CPU times: user 5.28 s, sys: 0.02 s, total: 5.30 s
Wall time: 5.40 s

An even bigger WTF is even after removing the unneeded argument and writing the most sensible code for my Counter object, it's STILL slower than the any-map version. Where is my carrot?!?:

class CounterNoArg(object):
    count = 0
    def count_one(self):
        self.count += 1

def straight_count():
    counter = CounterNoArg()
    for _ in itertools.repeat(None, 10000000):
        counter.count_one()
    print counter.count

Ouput:

In [111]: %time straight_count()
10000000
CPU times: user 5.44 s, sys: 0.02 s, total: 5.46 s
Wall time: 5.60 s

I'm asking because I think the Pythonistas or Pythoneers need a carrot so we don't start passing stuff to any or all for a performance increase, or does one already exist? Possibly an equivalent to itertools.imap that will just call a function over and over again, and optionally a certain amount of times.

The best I've managed are (using list comprehension gives interesting results):

def super_run():
    counter = CounterNoArg()
    for _ in (call() for call in itertools.repeat(counter.count_one, 10000000)):
        pass
    print counter.count

def super_counter_run():
    counter = CounterNoArg()
    [call() for call in itertools.repeat(counter.count_one, 10000000)]
    print counter.count

def run_any_counter():
    counter = Counter()
    any(itertools.imap(counter.count_one, itertools.repeat("foo", 10000000)))
    print counter.count

%time super_run()
10000000
CPU times: user 5.23 s, sys: 0.03 s, total: 5.26 s
Wall time: 5.43 s

%time super_counter_run()
10000000
CPU times: user 4.75 s, sys: 0.18 s, total: 4.94 s
Wall time: 5.80 s

%time run_any_counter()
10000000
CPU times: user 5.15 s, sys: 0.06 s, total: 5.21 s
Wall time: 5.30 s

def run_any_like_presentation():
    any(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))

def super_run_like_presentation():
    [do_work for do_work in itertools.imap(_md5.update, itertools.repeat("foo", 10000000))]

def super_run_like_presentation_2():
    [_md5.update(foo) for foo in itertools.repeat("foo", 10000000)]


%time run_any_like_presentation()
CPU times: user 5.28 s, sys: 0.02 s, total: 5.29 s
Wall time: 5.47 s

%time super_run_like_presentation()
CPU times: user 6.14 s, sys: 0.18 s, total: 6.33 s
Wall time: 7.56 s

%time super_run_like_presentation_2()
CPU times: user 8.44 s, sys: 0.22 s, total: 8.66 s
Wall time: 9.59 s

Ugh...

Note: I encourage you to run the tests yourself.

You should make small questions instead of this monologue... SO is not a blog — JBernardo, Feb 04 '12 at 22:05
Also, the `any` thing only works when the function returns a false value.... You can use `all` if it returns true values. On mixed values, you need something else — JBernardo, Feb 04 '12 at 22:07
I think your comment is out of place. Showing how I've tried to answer this question is perfectly acceptable and may save others the time of trying to figure it out themselves. — Derek Litz, Feb 04 '12 at 22:10
This one http://www.python.org/doc/essays/list2str would probably be useful. — Roman Bodnarchuk, Feb 04 '12 at 22:12
I must respectfully disagree with @JBernardo. This is one of the few very long questions I've seen here that is actually good. It might be possible and better to make it a series of smaller interrelated and hyperlinked questions, but I don't think that is strictly necessary. — jscs, Feb 04 '12 at 22:46

Gareth Rees · Answer 1 · 2012-02-04T23:17:45.157

4

In your first example, the first version of run has to look up _md5.update each time round the loop, whereas the second version does not. I think you'll find that accounts for most of the performance difference. The rest is probably to do with having to set the local variable i, though that's not so easy to demonstrate.

import itertools, hashlib, timeit
_md5 = hashlib.md5()

def run1():
    for i in itertools.repeat("foo", 10000000):
        _md5.update(i)

def run2():
    u = _md5.update
    for i in itertools.repeat("foo", 10000000):
        u(i)

def run3():
    any(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))

>>> timeit.timeit('run1()', 'from __main__ import run1', number=1)
6.081272840499878
>>> timeit.timeit('run2()', 'from __main__ import run2', number=1)
4.660238981246948
>>> timeit.timeit('run3()', 'from __main__ import run3', number=1)
4.062871932983398

The itertools documentation has a better recipe for consuming an iterator (and discarding all its values): see the consume function. The use of any to do this job depends on the fact that _md5.update always returns None, so this approach doesn't work in general. ~~Also, the recipe is very slightly faster:~~ [see comments]

import collections

def consume(it):
    "Consume iterator completely (discarding its values)."
    collections.deque(it, maxlen=0)

def run4():
    consume(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))

>>> timeit.timeit('run4()', 'from __main__ import run4', number=1)
3.969902992248535

Edited to add: it seems that the consume recipe is not as well known as it should be: if you look at the details of the CPython implementation, you'll see that when collections.deque is called with maxlen=0 then it calls the function consume_iterator in _collectionsmodule.c, which looks like this:

static PyObject*
consume_iterator(PyObject *it)
{
    PyObject *item;
    while ((item = PyIter_Next(it)) != NULL) {
        Py_DECREF(item);
    }
    Py_DECREF(it);
    if (PyErr_Occurred())
        return NULL;
    Py_RETURN_NONE;
}

edited Feb 04 '12 at 23:17

answered Feb 04 '12 at 22:40

Gareth Rees

64,967
9
133
163

I actually get run3 as faster than run4 when running this code (Python 2.7.2 on Win7 64bit). _run1 2.57853748827_, _run2 2.03119381833_, _run3 1.89985377945_, _run4 2.30778571169_ – Nisan.H Feb 04 '12 at 23:09
Intriguing! Try running with a higher `number` of repetitions and see what you get. On my setup (Python 2.7.2, Mac 64-bit) I find no significant difference between `run3` and `run4` after 10 repetitions (times within 1%). – Gareth Rees Feb 04 '12 at 23:21
I'm also running Mac 64-bit and don't see much difference between the two (nothing like Nisan.H). Sadly, it seems like the any version has a slight advantage every time I run it... negligible the longer it takes to process the iterator. If this is truly the case for consume perhaps this is why they chose to optimize using any... (so smaller iterators didn't start adding significant overhead over time). – Derek Litz Feb 05 '12 at 00:04
with `repeat("foo",100000000)` (1e+8 instead of 1e+7), and `number = 10`: `('run1', 8, 10, 255.6974894552196, 25.569748945521958) , ('run2', 8, 10, 203.28524775393402, 20.3285247753934) , ('run3', 8, 10, 192.10154148464142, 19.21015414846414) , ('run4', 8, 10, 233.2922627983511, 23.329226279835108) ,` Could this be architecture/memory specific? I'll try this again on my Linux box tonight. – Nisan.H Feb 05 '12 at 00:49
Same numbers, run on a core2 box running win7 64 and Ubuntu 8.10 64 (the previous box is a gen2 i7): `('run1', 8, 10, 326.78847574243053, 32.67884757424305), ('run2', 8, 10, 264.2153758744806, 26.421537587448064), ('run3', 8, 10, 249.48819697002807, 24.948819697002808), ('run4', 8, 10, 242.24297808842664, 24.224297808842664)`. It's interesting, seeing how the different CPU architectures behave with the different implementations. – Nisan.H Feb 05 '12 at 04:27
Thanks, Nisan—I have only a partial explanation for this difference. If you compare the implementations of [`consume_iterator`](http://hg.python.org/cpython/file/fd424ccc8cee/Modules/_collectionsmodule.c#l280) and [`builtin_any`](http://hg.python.org/cpython/file/fd424ccc8cee/Python/bltinmodule.c#l271) you'll see that the former calls `PyIter_Next` on each iteration, whereas the latter caches the `tp_iternext` function. Hard to imagine that this makes 15% difference on one architecture ... unless maybe there's a cache conflict? Any ideas? – Gareth Rees Feb 05 '12 at 13:50

score 2 · Answer 2 · answered Feb 05 '12 at 23:11

To answer the first question about optimizing by passing to any. No, I believe it is not a good idea for the main reason that it is not it's intended purpose. Sure it's easy to implement, but maintenance could become a nightmare. By doing this a new gotcha is introduced into your code base. If the function ever returns false, then the iterator will not be fully consumed causing strange behavior, and bugs that are hard to track down. Also, there exist faster (or at least nearly as fast) alternatives to using the built-in any.

Of course, you can make an exception because it seems any can actually out perform deque, but using any is certainly extreme and most often unnecessary. In fact, if anything, you may be introducing optimizations they may no longer be "optimal" after updates to the Python code base (see 2.7 vs 3.2).

Another thing to mention is this use of any doesn't immediately make any sense. Whether or not to implement a C extension before using any like this is also debatable. Personally, I'd prefer it for semantic reasons.

As far as optimizing your own code let's start with what we're up against: refer to run_any_like_presentation. It's pretty fast :)

An initial implementation could look something like:

import itertools, hashlib
_md5 = hashlib.md5()
def run():
    for _ in xrange(100000000):
        _md5.update("foo")

The first step is using itertools.repeat to do something N times.

def run_just_repeat():
    for foo in itertools.repeat("foo", 100000000):
        _md5.update(foo)

The second second optimization is to use itertools.imap to get a speed increase from not having to pass the foo reference in Python code. It is now in C.

def run_imap_and_repeat():
    for do_work in itertools.imap(_md5.update, itertools.repeat("foo", 10000000)):
        pass

The third optimization is to move the for loop entirely into C code.

import collections
def run_deque_imap_and_repeat():
    collections.deque(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))

The final optimization is to move all potential looks ups into the namespace of the run function:

This idea is taken from the very end of http://docs.python.org/library/itertools.html?highlight=itertools

Note, many of the above recipes can be optimized by replacing global lookups with local variables defined as default values.

Personally, I had mixed success with this showing improvements. ie. Small improvements under certain conditions, from module import xxx also showing performance increases without passing it in as well. Also, sometimes if I pass in some variables, and not others I see slight differences as well. The point is, I feel this one your going to need to test yourself to see if it works for you.

def run_deque_imap_and_repeat_all_local(deque = collections.deque, 
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat, 
        md5 = hashlib.md5):
    update = _md5.update
    deque(imap(_md5.update, repeat("foo", 100000000)), maxlen = 0)

And finally to be fair let's implement an any version like the presentation that does the final optimization as well.

def run_any_like_presentation_all_local(any = any, deque = collections.deque, 
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat, 
        md5 = hashlib.md5):
    any(imap(_md5.update, repeat("foo", 100000000)))

Ok now let's run some tests (Python 2.7.2 OS X Snow Leopard 64-bit):

run_reference - 123.913 seconds
run_deque_imap_and_repeat_all_local - 51.201 seconds
run_deque_local_imap_and_repeat - 53.013 seconds
run_deque_imap_and_repeat - 48.913 seconds
run_any_like_presentation - 49.833 seconds
run_any_like_presentation_all_local - 47.780 seconds

And just for kicks in Python3 (Python 3.2 OS X Snow Leopard 64-bit):

run_reference - 94.273 seconds (100000004 function calls!)
run_deque_imap_and_repeat_all_local - 23.929 seconds
run_deque_local_imap_and_repeat - 23.298 seconds
run_deque_imap_and_repeat - 24.201 seconds
run_any_like_presentation - 24.026 seconds
run_any_like_presentation_all_local - 25.316 seconds

Here's my source for the tests:

import itertools, hashlib, collections
_md5 = hashlib.md5()

def run_reference():
    for _ in xrange(100000000):
        _md5.update("foo")

def run_deque_imap_and_repeat_all_local(deque = collections.deque,
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    deque(imap(_md5.update, repeat("foo", 100000000)), maxlen = 0)

def run_deque_local_imap_and_repeat(deque = collections.deque,
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    deque(imap(_md5.update, repeat("foo", 100000000)), maxlen = 0)

def run_deque_imap_and_repeat():
    collections.deque(itertools.imap(_md5.update, itertools.repeat("foo", 100000000)),
            maxlen = 0)

def run_any_like_presentation():
    any(itertools.imap(_md5.update, itertools.repeat("foo", 100000000)))

def run_any_like_presentation_all_local(any = any, deque = collections.deque,
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    any(imap(_md5.update, repeat("foo", 100000000)))

import cProfile
import pstats

def performance_test(a_func):
    cProfile.run(a_func, 'stats')
    p = pstats.Stats('stats')
    p.sort_stats('time').print_stats(10)

performance_test('run_reference()')
performance_test('run_deque_imap_and_repeat_all_local()')
performance_test('run_deque_local_imap_and_repeat()')
performance_test('run_deque_imap_and_repeat()')
performance_test('run_any_like_presentation()')
performance_test('run_any_like_presentation_all_local()')

And Python3

import itertools, hashlib, collections
_md5 = hashlib.md5()

def run_reference(foo = "foo".encode('utf-8')):
    for _ in range(100000000):
        _md5.update(foo)

def run_deque_imap_and_repeat_all_local(deque = collections.deque,
        imap = map, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    deque(imap(_md5.update, repeat("foo".encode('utf-8'), 100000000)), maxlen = 0)

def run_deque_local_imap_and_repeat(deque = collections.deque,
        imap = map, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    deque(imap(_md5.update, repeat("foo".encode('utf-8'), 100000000)), maxlen = 0)

def run_deque_imap_and_repeat():
    collections.deque(map(_md5.update, itertools.repeat("foo".encode('utf-8'), 100000000)),
            maxlen = 0)

def run_any_like_presentation():
    any(map(_md5.update, itertools.repeat("foo".encode('utf-8'), 100000000)))

def run_any_like_presentation_all_local(any = any, deque = collections.deque,
        imap = map, _md5 = _md5, repeat = itertools.repeat):
    any(imap(_md5.update, repeat("foo".encode('utf-8'), 100000000)))

import cProfile
import pstats

def performance_test(a_func):
    cProfile.run(a_func, 'stats')
    p = pstats.Stats('stats')
    p.sort_stats('time').print_stats(10)

performance_test('run_reference()')
performance_test('run_deque_imap_and_repeat_all_local()')
performance_test('run_deque_local_imap_and_repeat()')
performance_test('run_deque_imap_and_repeat()')
performance_test('run_any_like_presentation()')
performance_test('run_any_like_presentation_all_local()')

Another thing, don't do this on a real project unless there is a certifiable performance bottleneck.

And, finally, if we really need a carrot (aside from writing code that makes sense, and isn't prone to error) in those hard times where any actually out performs deque, your more sensible code will be in a better position to take advantage of improvements in newer versions of Python without having to modify your code base.

http://www.python.org/doc/essays/list2str/ is a good read on how to approach Python optimization. (ie. ideally writing a C extension is NOT the first thing you reach for).

I'd also like to point out Gareth's answer as he may be on to why any can out perform deque.

Reinstate Monica · Answer 3 · 2012-02-04T22:44:51.393

1

The run_any_counter function has no explicit return value and thus returns None, which is False in a boolean context, and therefore any consumes the whole iterable.

A more general way to consume iterables is given in the recipes section for itertools. It doesn't rely on a false return value.

Comparing run_any_like_presentation etc: imap(f, seq) does the lookup of f only once, while the list comprehension [f(x) for x in seq] does it for each element of seq. [x for x in imap(f, seq)] is a funny way of spelling list(imap(f, x)), but both build an unnecessary list.

Finally, the for loop assign to the loop variable, even if it is not used. So that's slightly slower.

edited Feb 04 '12 at 22:44

answered Feb 04 '12 at 22:07

Reinstate Monica

4,568
1
24
35

Does Python actually build the list when I'm not assigning it to anything? Do you have a reference for that point? If you're right then Python allocates the memory for the list and then immediately releases it, due to reference counting. Is that what's happening? – Derek Litz Feb 04 '12 at 22:35
The docs for list displays say so ( http://docs.python.org/reference/expressions.html#list-displays ). But I guess that the list creation could actually be skipped. (Probably slightly different semantics: if no reference to the generated elements is kept, they may be garbage collected while the list is being built.) – Reinstate Monica Feb 04 '12 at 22:39
1

I think it'd be a good idea to mention that `any` is written in C which contributes much of the speed advantage. – Winston Ewert Feb 04 '12 at 22:44
@WinstonEwert I took that for granted, but I guess you are right. – Reinstate Monica Feb 04 '12 at 22:46

score 0 · Answer 4 · answered Mar 17 '12 at 00:40

He then does a bunch of hand waving without actually going into details as to why.

Because the actual looping is done natively instead of by interpreting Python bytecodes.

Why did my Counter object work and my print_true function fail miserably?

any stops as soon as it finds a true-ish return value, because it knows that the "any" condition is satisfied (short-circuit evaluation).

print_true returns "awesome", which is true-ish. counter.count_one doesn't have an explicit return, so it returns None, which is false-ish.

Passing iterators to any for execution for speed and Why?

4 Answers4

Linked