Python - What is relative performance of different techniques for appending together elements from many lists?

Question

There are several stackoverflow questions about ways to append together all the elements of many lists. What I don't see is a definitive answer about relative performance of these different methods. Also, sometimes a slight performance gain comes at the cost of some readability, so would be useful to know several different approaches.

The task: Given a list, that contains an arbitrary number of lists, each containing an arbitrary number of elements, form a single list, with all the elements. First all the elements from the first list, then all the elements of the second list, etc. This is a shallow append: if a list has an element that is a list, DON'T extract those individual elements. E.g.

append_all([ [11, 12, 13], [21, 22], [31, [320, 329], 33, 34] ])

=>

[11, 12, 13, 21, 22, 31, [320, 329], 33, 34]

related: [Flattening a shallow list in Python](http://stackoverflow.com/q/406121/4279) especially [this answer](http://stackoverflow.com/a/408281/4279) — jfs, Dec 20 '13 at 07:19
cool: I didn't scroll down far enough to see that answer, when searching for previous questions. [Link to graph from your comment](http://s403.photobucket.com/user/uber_ulrich/media/p10000.png.html). — ToolmakerSteve, Dec 20 '13 at 18:36
On which version of Python? Some of the answers are different for CPython vs. PyPy, CPython 2.7 vs. 3.4, 32-bit vs. 64-bit CPython, etc. And of course the answers are _also_ different for different number and length of lists. Also, do you actually need a list, or just a flattened iterable of any kind? Because an `itertools.chain` will obviously be faster than a list if your list is so big it causes page misses… — abarnert, Dec 20 '13 at 18:51
@abarnert: All very true - read the details and caveats in my answer. — ToolmakerSteve, Dec 20 '13 at 19:08

ToolmakerSteve · Answer 1 · 2013-12-20T19:09:09.367

Here are timings for several ways to append together multiple lists.

They are shown from fastest to slowest.

Python 2.7 (CPython - running inside Autodesk Maya 2014), Windows 7 64-bit, Intel Core i7-37770K @ 3.5 GHz.

import timeit
def p_timeit_min(msg, expr_str, number, setup):
    times = timeit.repeat(expr_str, number=number, setup=setup, repeat=3)
    print( '{0:18} => {1:6.3f}'.format( msg, min(times) ))

n = 1000
timeit.repeat('1+1', number=10000)   # "dummy" -- I'm in an environment where the first timeit call is erratic in performance.
setup_0 = ';  import operator;  L1 = list(range(n));  LL = [[10 * x + v for v in L1] for x in range(n)]'
print
p_timeit_min('map+extend 100', 'all = [];   map(all.extend, LL)', number=n, setup='n = 100'+setup_0)
p_timeit_min('for+extend 100', """
all = []
for L in LL:
    all.extend(L)
""", number=n, setup='n = 100'+setup_0)
p_timeit_min('extend 100', 'all = [];   [all.extend(L) for L in LL]', number=n, setup='n = 100'+setup_0)
# reduce with [] initializer, to avoid need to wrap each L in list().
p_timeit_min('reduce+iadd 100 []', 'all = reduce(operator.iadd, LL, [])', number=n, setup='n = 100'+setup_0)
p_timeit_min('filter extend 100', 'all = [];   filter( lambda x: False, iter(all.extend(L) for L in LL) )', number=n, setup='n = 100'+setup_0)
# WARNING: If remove "list()" wrapper around "list_", since this version isn't supplying a [] to accumulate the results, iadd will MODIFY the first element of LL, which may not be desired.
p_timeit_min('reduce+iadd 100 list()', 'all = reduce(operator.iadd, (list(list_) for list_ in LL))', number=n, setup='n = 100'+setup_0)
p_timeit_min('chain 100', 'all = list(itertools.chain(*LL))', number=n, setup='n = 100'+setup_0)
p_timeit_min('comprehension 100', 'all = [x for list_ in LL for x in list_]', number=n, setup='n = 100'+setup_0)
p_timeit_min('nested for append 100',
"""
all = []
for L in LL:
    for x in L:
        all.append(L)
""", number=n, setup='n = 100'+setup_0)
p_timeit_min('sum 100', 'all = sum(LL, [])', number=n, setup='n = 100'+setup_0)
print
p_timeit_min('map+extend 200', 'all = [];   map(all.extend, LL)', number=n, setup='n = 200'+setup_0)
p_timeit_min('for+extend 200', """
all = []
for L in LL:
    all.extend(L)
""", number=n, setup='n = 200'+setup_0)
p_timeit_min('extend 200', 'all = [];   [all.extend(L) for L in LL]', number=n, setup='n = 200'+setup_0)
p_timeit_min('reduce+iadd 200 []', 'all = reduce(operator.iadd, LL, [])', number=n, setup='n = 200'+setup_0)
p_timeit_min('filter extend 200', 'all = [];   filter( lambda x: False, iter(all.extend(L) for L in LL) )', number=n, setup='n = 200'+setup_0)
p_timeit_min('reduce+iadd 200 list()', 'all = reduce(operator.iadd, (list(list_) for list_ in LL))', number=n, setup='n = 200'+setup_0)
p_timeit_min('chain 200', 'all = list(itertools.chain(*LL))', number=n, setup='n = 200'+setup_0)
p_timeit_min('comprehension 200', 'all = [x for list_ in LL for x in list_]', number=n, setup='n = 200'+setup_0)
p_timeit_min('nested for append 200', """
all = []
for L in LL:
    for x in L:
        all.append(L)
""", number=n, setup='n = 200'+setup_0)
p_timeit_min('sum 200', 'all = sum(LL, [])', number=n, setup='n = 200'+setup_0)
print

Output:

map+extend 100     =>  0.062
for+extend 100     =>  0.064  ** within margin of error of first place, but slower on average
extend 100         =>  0.066
reduce+iadd 100 [] =>  0.063  ** see "200" case for reason this isn't placed higher in list.
filter extend 100  =>  0.078
reduce+iadd 100 list=> 0.105  ** ignore this - use the better "reduce" above.
chain 100          =>  0.127
comprehension 100  =>  0.250
nested for append 100=>0.672
sum 100            =>  1.424

These times ~ 4x longer, for O(n) order algorithms -

"200" case is 200 x 200, so 4x as many total elements as "100" case.

OBSERVE: The top 5 variations are doing significantly better than O(n) - about 3x longer for 4x as many elements. This is because each list is longer; number of sub-lists increased by 2x:

map+extend 200     =>  0.187
for+extend 200     =>  0.190
extend 200         =>  0.194
reduce+iadd 200 [] =>  0.204
filter extend 200  =>  0.217
reduce+iadd 200 list=> 0.311  ** ignore this - use the better "reduce" above.
chain 200          =>  0.426
comprehension 200  =>  0.931
nested for append 200=>2.676
sum 200            => 13.432

ANALYSIS: The top four solutions are not substantially different, if each list has many elements.

A nested list comprehension takes 4x as long (as best solution). On the other hand, it is still O(n) for total # elements. For many situations, this is 4x of a small enough time value that it doesn't matter.

CONCLUSION: if this is a performance-critical factor for your situation, use list.extend with any means of looping, or reduce(operator.iadd, LL, []). Runner-up choices: itertools.chain at 2x cost, or [ .. for .. for .. ] (nested list comprehensions) at 4x cost.

CAVEAT: This is one test, on one machine, on one implementation, of Python 2.7.

It assumes you have lists & you need the result as a list; if you have/need sequences/generators/iterators, that might change the balance (perhaps in favor of itertools.chain)?

TODO: A test with many SHORT lists.

lol. Good point. Hmm, how do I pass a multi-line string to timeit? I guess I try with '\n' character in the string. — ToolmakerSteve, Dec 20 '13 at 03:45
A triple-quoted literal can be an effective option for long commands. — user2357112, Dec 20 '13 at 03:47
`reduce` takes an optional 3rd parameter as a starting point for the reduction, which could speed up the `reduce` version. — user2357112, Dec 20 '13 at 03:49
`"""this part can span multiple lines"""` - just type regular Python code in the string. — user2357112, Dec 20 '13 at 03:53
ok, got both forms of "for": a single for + extend, and nested for + append. — ToolmakerSteve, Dec 20 '13 at 04:09
Testing 100x100 and 200x200 may not be enough variation to verify algorithmic complexity, or to catch issues like different cache and page-miss thresholds. — abarnert, Dec 20 '13 at 18:55
Also, for all of the mechanisms that use a function call or method call inside a loop (like `all.extend`), try copying the function or bound method to a local variable to avoid all the lookup costs. (In some cases that requires turning a lambda into a def.) — abarnert, Dec 20 '13 at 18:56
For one example: If you wrap the "filter extend" code in a function in setup, then make the test code `all[:]=[]; f(all)`, doing `allx = all.extend` and using that inside the genexpr instead of using `all.extend` directly drops the time from 0.336 to 0.307. (Of course if you're doing it at global scope, that's a different story. But realistically function scope is a more likely place for people to be optimizing.) — abarnert, Dec 20 '13 at 19:15
Anyway, [here](http://pastebin.com/5fjCZFrw) are some additional numbers on your test as-is, to bring Python 3 and PyPy into the mix. (You might want to verify the py3 port.) — abarnert, Dec 20 '13 at 19:18
@abarnert: awesome. What is best code wrapper for testing iterator/generator performance? Something with deque/pop? I need to know the exact details to use; I'm hazy on fairest test, when don't need lists... — ToolmakerSteve, Dec 20 '13 at 19:41
@abarnert. Also, I'm not sure on quickest way to REGENERATE an iteration-of-iterations for each timing loop. — ToolmakerSteve, Dec 20 '13 at 19:47
@ToolmakerSteve: Somewhere on SO, there's a question with extensive tests with something like CPython 2.7.3, 3.2.2, and 3.3.0, 32- and 64-bit, plus PyPy 1.9.0/2.7.2 64-bit, plus some version of IronPython with both native .NET and mono, and determined that `collections.deque(it, maxlen=0)` was always the fastest way to consume an iterator. Of course that might not always match real-life use cases, where you're often consuming it a bit at a time in between slower operations, rather than blasting through as fast as possible while everything is still in cache… — abarnert, Dec 20 '13 at 22:09
@ToolmakerSteve: Meanwhile, as long as you regenerate the outermost iterable (in your case, `all[:] = []`) in the setup instead of the timed code, it doesn't really matter how you do it. — abarnert, Dec 20 '13 at 22:10
@ToolmakerSteve: Anyway, from my results, it looks like using any of the `extend` methods, using a local copy of the bound method when appropriate, is within a few percent of optimal on every implementation either of us tested, which is probably good enough, just as you initially suggested. The remaining questions are (a) different list sizes (easy but maybe tedious to test and graph), and (b) different use cases for iterables (much harder to test). — abarnert, Dec 20 '13 at 22:12

Python - What is relative performance of different techniques for appending together elements from many lists?

1 Answers1