Why is itertools.chain faster than a flattening list comprehension?

Question

In the context of a discussion in the comments of this question it was mentioned that while concatenating a sequence of strings simply takes ''.join([str1, str2, ...]), concatenating a sequence of lists would be something like list(itertools.chain(lst1, lst2, ...)), although you can also use a list comprehension like [x for y in [lst1, lst2, ...] for x in y]. What surprised me is that the first method is consistently faster than the second:

import random
import itertools

random.seed(100)
lsts = [[1] * random.randint(100, 1000) for i in range(1000)]

%timeit [x for y in lsts for x in y]
# 39.3 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(itertools.chain.from_iterable(lsts))
# 30.6 ms ± 866 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(x for y in lsts for x in y)  # Proposed in comments
# 62.5 ms ± 504 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Loop-based methods proposed in the comments
%%timeit
a = []
for lst in lsts: a += lst
# 26.4 ms ± 634 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
a = []
for lst in lsts: a.extend(lst)
# 26.7 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is not a difference of orders of magnitude, but it is not negligible either. I was wondering how that might be the case, since list comprehensions are frequently among the fastest methods to solve a given problem. At first I thought that maybe the itertools.chain object would have a len that the list constructor could use to preallocate the necessary memory, but that is not the case (cannot call len on itertools.chain objects). Is some custom itertools.chain-to-list conversion taking place somehow or is itertools.chain taking advantage of some other mechanism?

Tested in Python 3.6.3 on Windows 10 x64, if that is relevant.

EDIT:

It seems the fastest method after all is calling .extending an empty list with each list, as proposed by @zwer, probably because it works on "chunks" of data instead of on a per-element basis.

@PeterWood Added it for comparison, seems even worse, which is also kind of surprising. — jdehesa, Apr 03 '18 at 13:33
My guess is that the list comprehension has to do some expression evaluation in python space whereas the itertools variant isn't doing any expression evaluation (it's just yielding things entirely in C-space). — mgilson, Apr 03 '18 at 13:33
@mgilson You might be right, although if I change the number of lists the gap seems to scale. Ah wait, you mean expression evaluation on each iteration? — jdehesa, Apr 03 '18 at 13:35
@jdehesa The [byte-code generated by LCs](https://ideone.com/zYcKZw) compared to functions that directly delegate to C makes a huge difference. — Ashwini Chaudhary, Apr 03 '18 at 13:42
It would be useful if someone with C experience could dig into the difference. Could be something to do with memory allocation, but just a guess. — jpp, Apr 03 '18 at 13:50
@AshwiniChaudhary I see, so that's it then, it's just faster because it's in C? I really feel there should be a `[].join([lst1, lst2, ...])` syntax for lists (and maybe tuples), it's a rather common operation that most likely many developers are not implementing as well as it could be because it is kind of cumbersome (and a dedicated `join` method could probably do it even faster). — jdehesa, Apr 03 '18 at 13:55
@jdehesa There were several propositions when this was being decided, [Guido chose `str.join` at the end](https://mail.python.org/pipermail/python-dev/1999-June/095436.html). — Ashwini Chaudhary, Apr 03 '18 at 14:01
If C-level detail isn't already provided elsewhere, I'll offer a bounty for a good answer. I have a vested interest: all too often I get shot down for using `chain.from_iterable` "..when you can do it without a library with nested so and so!" — jpp, Apr 03 '18 at 14:02
@jpp Whoever is shooting you down for using itertools should check with a doctor :-P. With that said, it's basically a case of more code running at Python level vs C level. Chain's `next` method isn't doing anything special, it's just doing everything at C level: https://hg.python.org/cpython/file/c79fd57f86b6/Modules/itertoolsmodule.c#l1846 — Ashwini Chaudhary, Apr 03 '18 at 14:05
Oh, by the way, related and still not answered: https://stackoverflow.com/questions/45190729/differences-between-generator-comprehension-expressions — Right leg, Apr 03 '18 at 14:30
@jdehesa - While there isn't `[].join(...)` simple straight concatenation will still work. I'd even wager that it's faster than `itertools.chain()`, too. — zwer, Apr 03 '18 at 14:58
@zwer What do you mean with "simple straight concatenation"? Just `+`/`sum`? — jdehesa, Apr 03 '18 at 14:59
@jdehesa - Yep. Initiate an empty list `target = []` and then just flat extend it: `for x in lsts: target += x`. I can test it, but based on previous experiences I'm almost willing to bet it would outperform `itertools.chain()`. A bit more verbose, tho, but you can always write a `list_join()` function and stash it away to use instead of the _missing_ `[].join(...)`. — zwer, Apr 03 '18 at 15:01
@zwer Just checked, looks about as fast as the list comprehension. As was suggested, I suppose it's just a matter of running more code in C than Python with `itertools`. — jdehesa, Apr 03 '18 at 15:14
@jdehesa - That's weird, I just tested it and concatenation is consistently ~20% faster than `itertools.chain`, both on Python 2.7.14 and 3.5.1. — zwer, Apr 03 '18 at 16:40
@zwer Huh, you're right, I run it again now and it looks like what you say, maybe something in the background affected the results earlier or something... I suppose this is the fastest because it works "per list", instead of "per element" (that's what I meant when I was saying a specialized `.join` method could be faster). Thanks for the suggestion. — jdehesa, Apr 03 '18 at 16:49

FHTMitchell · Accepted Answer · 2018-04-03T14:48:10.333

Here is itertools.chain.from_iterable. It's not hard to read even if you don't know C and you can tell everything is happening at the c level (before being used to generate a list in your code).

The bytecode for list comprehensions is like so:

def f(lsts):
    return [x for y in lsts for x in y]

dis.dis(f.__code__.co_consts[1])
  2           0 BUILD_LIST               0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                18 (to 24)
              6 STORE_FAST               1 (y)
              8 LOAD_FAST                1 (y)
             10 GET_ITER
        >>   12 FOR_ITER                 8 (to 22)
             14 STORE_FAST               2 (x)
             16 LOAD_FAST                2 (x)
             18 LIST_APPEND              3
             20 JUMP_ABSOLUTE           12
        >>   22 JUMP_ABSOLUTE            4
        >>   24 RETURN_VALUE

These are all the python interpreter operations involved in creating a list comprehension. Just having all the operations at the C level (in chain) rather than having the interpreter step over each byte code step (in the comprehension) is what will give you that performance boost.

Still, that boost is so small I wouldn't worry about it. This is python, readability over speed.

Edit:

For a list wrapped generator comprehension

def g(lists):
    return list(x for y in lsts for x in y)

# the comprehension
dis.dis(g.__code__.co_consts[1])
  2           0 LOAD_FAST                0 (.0)
        >>    2 FOR_ITER                20 (to 24)
              4 STORE_FAST               1 (y)
              6 LOAD_FAST                1 (y)
              8 GET_ITER
        >>   10 FOR_ITER                10 (to 22)
             12 STORE_FAST               2 (x)
             14 LOAD_FAST                2 (x)
             16 YIELD_VALUE
             18 POP_TOP
             20 JUMP_ABSOLUTE           10
        >>   22 JUMP_ABSOLUTE            2
        >>   24 LOAD_CONST               0 (None)
             26 RETURN_VALUE

So the interpreter has a similar number of steps to go to in running the generator expression being unpacked by list, but as you would expect, the python level overhead of having list unpack a generator (as opposed to the C LIST_APPEND instruction) is what slows it down.

dis.dis(f)
  2           0 LOAD_CONST               1 (<code object <listcomp> at 0x000000000FB58B70, file "<ipython-input-33-1d46ced34d66>", line 2>)
              2 LOAD_CONST               2 ('f.<locals>.<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_FAST                0 (lsts)
              8 GET_ITER
             10 CALL_FUNCTION            1
             12 RETURN_VALUE

dis.dis(g)
  2           0 LOAD_GLOBAL              0 (list)
              2 LOAD_CONST               1 (<code object <genexpr> at 0x000000000FF6F420, file "<ipython-input-40-0334a7cdeb8f>", line 2>)
              4 LOAD_CONST               2 ('g.<locals>.<genexpr>')
              6 MAKE_FUNCTION            0
              8 LOAD_GLOBAL              1 (lsts)
             10 GET_ITER
             12 CALL_FUNCTION            1
             14 CALL_FUNCTION            1
             16 RETURN_VALUE

That only explains why the first approach is way slower than the two other, but not why the `itertools.chain` approach is faster than the cast of a generator comprehension into a list, isn't it? — Right leg, Apr 03 '18 at 14:19
@Rightleg The list-wrapped generator comprehension was slower than both because a list comprehension is a c-optimised list-wrapped generator comprehension. (so many adjectives :/) — FHTMitchell, Apr 03 '18 at 14:24
@FHTMitchell "list comprehension is a c-optimised list wrapped generator comprehension" is completely wrong. LCs don't create generator expression, it creates a list( check BUILD_LIST in your answer) and then populates it using special `LIST_APPEND` bytecode that calls `PyList_Append`. — Ashwini Chaudhary, Apr 03 '18 at 14:35
@AshwiniChaudhary fair enough. I did know that, I suppose I was trying to simplify things. I'll edit my answer. — FHTMitchell, Apr 03 '18 at 14:42
I've updated the answer, since a simple for loop, as a user suggested, is actually faster than all the other options... — jdehesa, Apr 03 '18 at 16:50
I think that's quite obvious. You're doing 1000 operations instead of 500 times that amount. You are also preallocating some of the memory as opposed to dynamically increasing it. — FHTMitchell, Apr 03 '18 at 17:46

Why is itertools.chain faster than a flattening list comprehension?

1 Answers1

Linked