27

I am curious what the fastest way to consume an iterator would be, and the most Pythonic way.

For example, say that I want to create an iterator with the map builtin that accumulates something as a side-effect. I don't actually care about the result of the map, just the side effect, so I want to blow through the iteration with as little overhead or boilerplate as possible. Something like:

my_set = set()
my_map = map(lambda x, y: my_set.add((x, y)), my_x, my_y)

In this example, I just want to blow through the iterator to accumulate things in my_set, and my_set is just an empty set until I actually run through my_map. Something like:

for _ in my_map:
    pass

or a naked

[_ for _ in my_map]

works, but they both feel clunky. Is there a more Pythonic way to make sure an iterator iterates quickly so that you can benefit from some side-effect?


Benchmark

I tested the two methods above on the following:

my_x = np.random.randint(100, size=int(1e6))
my_y = np.random.randint(100, size=int(1e6))

with my_set and my_map as defined above. I got the following results with timeit:

for _ in my_map:
    pass
468 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[_ for _ in my_map]
476 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

No real difference between the two, and they both feel clunky.

Note, I got similar performance with list(my_map), which was a suggestion in the comments.

Engineero
  • 12,340
  • 5
  • 53
  • 75
  • 3
    Instead of `[_ for _ in my_map]` you can also just do `list(my_map)` and discard the result – Matias Cicero Jun 19 '18 at 23:01
  • I'm pretty sure this is a dup of an old question, and I'm pretty sure I referred to that old question in an answer of my own… but I can't find it. (I can find a few answers like [this one](https://stackoverflow.com/questions/15843883/does-this-benchmark-seem-relevant/15844134#15844134), but they all just flat-out state that `deque` is fastest, without a link…). – abarnert Jun 19 '18 at 23:06
  • You can also use `[0 for _ in my_map if False]` instead of `[_ for _ in my_map]` if you want to reduce memory allocation. It should also be faster. – Carlos Pinzón Nov 24 '20 at 15:23
  • I wouldn't use either `[_ for _ in my_map]` nor `list(my_map)` because even for a while, it creates a list and if the number of items in the `my_map` is large, this may cause memory problems - so if there is really no value to be gathered, I consider creating intermediary list a wrong thing to do. I don't like it, but I think plain ugly explicit for list (`for _ in my_map: pass`) is the best way to go. – Jan Spurny Jun 28 '22 at 19:29

3 Answers3

31

While you shouldn't be creating a map object just for side effects, there is in fact a standard recipe for consuming iterators in the itertools docs:

def consume(iterator, n=None):
    "Advance the iterator n-steps ahead. If n is None, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

For just the "consume entirely" case, this can be simplified to

def consume(iterator):
    collections.deque(iterator, maxlen=0)

Using collections.deque this way avoids storing all the elements (because maxlen=0) and iterates at C speed, without bytecode interpretation overhead. There's even a dedicated fast path in the deque implementation for using a maxlen=0 deque to consume an iterator.

Timing:

In [1]: import collections

In [2]: x = range(1000)

In [3]: %%timeit
   ...: i = iter(x)
   ...: for _ in i:
   ...:     pass
   ...: 
16.5 µs ± 829 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: %%timeit
   ...: i = iter(x)
   ...: collections.deque(i, maxlen=0)
   ...: 
12 µs ± 566 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Of course, this is all based on CPython. The entire nature of interpreter overhead is very different on other Python implementations, and the maxlen=0 fast path is specific to CPython. See abarnert's answer for other Python implementations.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • That's great, thanks! And I should have added the caveat that this is, in general, a bad idea, but believe it or not I found a case where I need to do it! – Engineero Jun 19 '18 at 23:00
  • @zvone: It's faster. Timings added. – user2357112 Jun 19 '18 at 23:12
  • @Engineero Such cases aren't super-common, but they do come up—otherwise, `itertools` wouldn't have a recipe for it in the docs, and `deque` wouldn't have an optimization for it in the C source. – abarnert Jun 19 '18 at 23:13
  • 4
    It's worth noting that in PyPy, `deque` is not optimized for maxlen=0, so it's actually more than an order of magnitude _slower_ than a `for` loop (23.3us vs. 1.17us on my laptop). – abarnert Jun 19 '18 at 23:17
  • I find your timings for the different consumptions misleading. You are actually mostly measuring how fast `range` produces the `int` objects. I changed `x = range(1000)` to `x = tuple(range(1000))`, which reduced the `deque`'s "consumption" time from 17 µs to 3 µs! (And the time for the loop from 23 µs down to 9 µs.) – Stefan Pochmann Mar 28 '22 at 08:43
  • @StefanPochmann: Perhaps. The timing was intended to demonstrate that `deque` is faster, and it demonstrates that, but it may be misconstrued as showing an absolute performance ratio. – user2357112 Mar 28 '22 at 10:19
  • That said, both this timing and the `x = tuple(range(1000))` timing have another flaw - they give a misleading impression of how much this overhead actually matters. Both range iterators and tuple iterators are far faster than anything that would be passed to `consume` in a practical use case. In a practical use case, the absolute performance ratio between `for` and `deque` matters even less than the answer currently suggests, because the iterator itself will dominate the runtime even more than the range iterator does. – user2357112 Mar 28 '22 at 10:24
  • I'll add some more timings later to give a more comprehensive view, but I don't have the time right now. – user2357112 Mar 28 '22 at 10:29
  • @user2357112supportsMonica Yes, for showing that it's faster it's enough. And we can't *completely* remove the time contribution from the iterable (well, maybe by profiling the C code). And I mostly agree about these overheads mattering less in most practical use cases. Still, I always strive to minimize the benchmark overhead and ideally show just the time for the thing I *want* to measure. If the consumption is half of those 3 µs, then the 17 µs were 91% overhead and just 9% consumption. – Stefan Pochmann Apr 03 '22 at 10:16
  • (continuing) One reason I *always* strive to minimize the overhead is that I consider it good practice, maybe more important in other cases, and to set a good example for others. For example, abarnert's answer reused your code and made it look like PyPy's top iteration speed is almost 10 times as fast as CPython's, which I find more seriously misleading. It would be more like factor 2, if CPython were given a `tuple`. – Stefan Pochmann Apr 03 '22 at 10:16
  • (continuing) Some [timings in PyPy](https://tio.run/##hVDLTsMwELz7K@ZS2UahpFTlUClfghAK6QYsxQ/sjUS@PthJKRWqxJ7smdnRzoSJP7zb34cpTPPcR2/BxpJhGBt8ZEQK1LIQiXgMaCClFGeq88NAHRvvkoiZiq17J7Wr61oLzv@8MJC6QrUo26LzJ0qZfxbII3sf8QrjYJiiivqI0KYkq1ss/2GvTtie6HMkdTapYNuvgVxT63@0fEs7mMQ/VjdAXsAXcTluDbnXx0VZ4BKyMEvYFS5TerHGqbVWVdgKS7cV3GjfKDZrVXhAeVwWQzSOldwcto89xgSJDRTjDjt6ygmKkRa/Qj3P3w) (About 1.5 us for the for-loop, 26 us for `deque(iterator)` and 1.8 us for `list(iterator)`. Both on `range` and on `tuple`). – Stefan Pochmann Apr 03 '22 at 10:16
  • (continuing) Same [timings in CPython](https://tio.run/##hVBda8MwDHzXr9BLsT1Cl65sjEJ@yRgjS5XNEH9MVqD99ZmddF0ZgenJvjsduotn@Qx@/xx5mnoODsU6soLWxcCCTJFaAUgkY8QGlVJwobowDNSJDT4BZ4pb/0F6V9e1Acn/vDCQvkENlG3owpFS5l8A86g@ML6h9WiFWLM5YGxTUtUaK3/YmxO2R/oaSV9MKnTtaSDf1OYfraxpB5vkx2oFlBl8hetxS8i9OczKApeQhZnDLnCZ0ouzXi@16sJWOHdboR/dO3GzVIX3WB7XxcjWi1abx@1Dj2NChRvUgne4o6ecoBgZ@BWaafoG) (On `tuple` about 8.7 us for the for-loop, 3.0 us for `deque(iterator)` and 3.9 us for `list(iterator)`. On `range` about 23 us for the for-loop, 17 us for `deque(iterator)` and 24 us for `list(iterator)`). – Stefan Pochmann Apr 03 '22 at 10:22
15

If you only care about CPython, deque is the fastest way, as demonstrated in user2357112's answer.1 And the same thing has been demonstrated in 2.7 and 3.2, and 32- vs. 64-bit, and Windows vs. Linux, and so on.

But that relies on an optimization in CPython's C implementation of deque. Other implementations may have no such optimization, which means they end up calling an append for each element.

In PyPy in particular, there is no such optimization in the source,2 and the JIT cannot optimize that no-op append out. (And it's hard to see how it couldn't require at least a guard test each time through the loop.) Of course compared to the cost of looping in Python… right? But looping in Python is blazing fast in PyPy, almost as fast as a C loop in CPython, so this actually makes a huge difference.

Comparing the times (using identical tests as in user's answer:3

          for      deque
CPython   19.7us   12.7us
PyPy       1.37us  23.3us

There's no 3.x versions of the other major interpreters, and I don't have IPython for any of them, but a quick test with Jython shows similar effects.

So, the fastest portable implementation is something like:

if sys.implementation.name == 'cpython':
    import collections
    def consume(it):
        return collections.deque(it, maxlen=0)
else:
    def consume(it):
        for _ in it:
            pass

This of course gives me 12.7us in CPython, and 1.41us in PyPy.


1. Of course you could write a custom C extension, but it's only going to be faster by a tiny constant term—you can avoid the constructor call and the test before jumping to the fast path, but once you get into that loop, you have to do exactly what it's doing.

2. Tracing through PyPy source is always fun… but I think it ends up in the W_Deque class that's, which is part of the builtin _collections module.

3. CPython 3.6.4; PyPy 5.10.1/3.5.3; both from the respective standard 64-bit macOS installers.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • This is great, thanks! I hadn't really considered the difference in implementation between different interpreters. I am currently stuck in a Jupyter notebook. I am going to have to look into this aspect more... – Engineero Jun 20 '18 at 13:53
  • 1
    @Engineero If you didn't already know this: Jupyter notebooks can run PyPy kernels. (Last time I tried, there were some glitches using PyPy kernels with a CPython notebook, while the other way around worked fine but had no way to install the local Qt stuff. But if you don't need to run both kernels together, you can just have two parallel Jupyter installations.) – abarnert Jun 20 '18 at 23:48
3

The more_itertools package provides a consume() method. But on my PC (python 3.5) it's on par with the deque solution. You might check if it brings an advantage on your specific interpreter.

>>>timeit.timeit(lambda: collections.deque(range(1,10000000),maxlen=0),number=10)
1.0916123000000084
>>>timeit.timeit(lambda: more_itertools.consume(range(1,10000000)),number=10)
1.092838400000005
Engineero
  • 12,340
  • 5
  • 53
  • 75
B.R.
  • 234
  • 2
  • 7
  • 2
    It is _on par with the deque solution_ because... [it is the deque solution](https://more-itertools.readthedocs.io/en/stable/_modules/more_itertools/recipes.html#consume). – Ignatius Reilly Mar 16 '23 at 21:50
  • Very good point! I did not check the under-lying code, in this case it makes no real sense to use the `more_itertools.consume()` to have better performance. – B.R. Mar 17 '23 at 19:16