107

Joining a list:

>>> ''.join([ str(_) for _ in xrange(10) ])
'0123456789'

join must take an iterable.

Apparently, join's argument is [ str(_) for _ in xrange(10) ], and it's a list comprehension.

Look at this:

>>>''.join( str(_) for _ in xrange(10) )
'0123456789'

Now, join's argument is just str(_) for _ in xrange(10), no [], but the result is the same.

Why? Does str(_) for _ in xrange(10) also produce a list or an iterable?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Alcott
  • 17,905
  • 32
  • 116
  • 173
  • 1
    I would imagine that `join` is most likely written in C and therefore runs much faster than a list comprehension... Testing time! – Joel Cornett Jan 30 '12 at 07:31
  • Apparently, I read your question completely wrong. It seems to be returning a generator for me... – Joel Cornett Jan 30 '12 at 07:38
  • 21
    Just a note: `_` has no special meaning, it's a regular variable name. It's often used as a throw-away name but this is not the case (you are using the variable). I would avoid using it in a code (in this way at least). – rplnt Jan 30 '12 at 09:22

7 Answers7

164

The other respondents were correct in answering that you had discovered a generator expression (which has a notation similar to list comprehensions but without the surrounding square brackets).

In general, genexps (as they are affectionately known) are more memory efficient and faster than list comprehensions.

HOWEVER, it the case of ''.join(), a list comprehension is both faster and more memory efficient. The reason is that join needs to make two passes over the data, so it actually needs a real list. If you give it one, it can start its work immediately. If you give it a genexp instead, it cannot start work until it builds-up a new list in memory by running the genexp to exhaustion:

~ $ python -m timeit '"".join(str(n) for n in xrange(1000))'
1000 loops, best of 3: 335 usec per loop
~ $ python -m timeit '"".join([str(n) for n in xrange(1000)])'
1000 loops, best of 3: 288 usec per loop

The same result holds when comparing itertools.imap versus map:

~ $ python -m timeit -s'from itertools import imap' '"".join(imap(str, xrange(1000)))'
1000 loops, best of 3: 220 usec per loop
~ $ python -m timeit '"".join(map(str, xrange(1000)))'
1000 loops, best of 3: 212 usec per loop
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • Strange, I get the opposite result for `"".join(l)`. I use timeit setup to set `l` to either `[str(n) for n in xrange(1000)]` or `(x for x in [str(n) for n in xrange(1000)])`. Results: 21.7 usec and 1.55 usec, respectively. `iter([str(n) for n in xrange(1000)])` is faster still: 0.87 usec. – Lauritz V. Thaulow Jan 30 '12 at 09:07
  • 6
    @lazyr Your second timing is doing too much work. Don't wrap a genexp around a listcomp -- just use a genexp directly. No wonder you got odd timings. – Raymond Hettinger Jan 30 '12 at 09:51
  • 2
    In an effort to time _only_ `join`, I put everything except `"".join(l)` in the setup statement. And therein lies the problem -- the genexp was exhausted on the first loop, so the next 999999 times it joined an empty genexp. No wonder it was fast. – Lauritz V. Thaulow Jan 30 '12 at 10:06
  • 13
    Could you explain why `''.join()` needs 2 passes over the iterator to build a string? – ovgolovin Jan 30 '12 at 10:42
  • 34
    @ovgolovin I'd guess the first pass is to sum the lengths of the strings so as to be able to allocate the correct amount of memory for the concatenated string, while the second pass is to copy the individual strings into the allocated space. – Lauritz V. Thaulow Jan 30 '12 at 13:42
  • 25
    @lazyr That guess is correct. That is exactly what str.join does :-) – Raymond Hettinger Jan 30 '12 at 15:49
  • 2
    Python 3.3 appears to be significantly slower at these (like 40-50% slower) – Andy Hayden Jan 23 '14 at 05:31
  • What do you mean with list comprehension being more memory efficient here? The genexp only has O(1) space overhead, doesn't it? Hardly seems worth mentioning. Or am I overlooking something else? – Stefan Pochmann Apr 14 '17 at 09:40
  • 1
    @StefanPochmann FWIW, the memory differences aren't the same in Python 3, since a Python 3 list comp creates a new scope whereas in Python 2 a list comp runs in the surrounding scope. – PM 2Ring May 10 '18 at 01:33
  • @LauritzV.Thaulow but building a temporary list has the same problem, with the drawback that the list buffer will be much larger than the character buffer since it's holding 8 byte pointers vs. 4 byte code points. Something doesn't add up. – Mark Ransom Mar 29 '22 at 12:21
  • 2
    @MarkRansom: It doesn't have the same problem; `list`s have amortized `O(1)` appends (each time it runs out of space it allocates a multiple of the current capacity and doesn't have to reallocate again for awhile). `str` being immutable means it never intentionally overallocates for a single concatenation even when the in-place concatenation optimization is used, so *every* concatenation triggers a `realloc`, and some of them actually have to move the `str` (when the heap doesn't have space to expand the allocation in-place). Repeated concatenation is `O(n**2)`, `list`ifying is `O(n)`. – ShadowRanger Jun 10 '22 at 03:41
  • @PM2Ring But that's an argument for the *opposite*, no? That the list comprehension would be *less* memory efficient, because the last value assigned to `n` is kept alive. I don't see a reason why the list comprehension is or ever was *more* memory efficient (except for the generator's O(1) overhead and possibly different overallocation strategies of list comp and the list creation used by `join`). I'm curious what Raymond meant. – Kelly Bundy Feb 12 '23 at 12:58
  • 1
    @RaymondHettinger I keep referencing this answer whenever someone recommends using a generator in `join` for (memory) efficiency. But since the answer is old, I'd like to also link to the code to see the current implementation. Is it [Objects/stringlib/join.h](https://github.com/python/cpython/blob/main/Objects/stringlib/join.h)? I'm always uncertain with strings, because their implementation is spread over so many files and there are different kinds of strings. – Kelly Bundy Feb 12 '23 at 13:03
  • @KellyBundy Fair point. I was referring to the call stack Frame object that gets created for the new scope, although of course that object can be recycled when the list comp finishes executing. – PM 2Ring Feb 12 '23 at 15:45
  • @PM2Ring Ah, ok. But that's just an insignificant constant overhead, right? I [tracemalloc'ed](https://tio.run/##lY5BCoMwEEX3OcXgKgGx2lIohV7BC4hItKlNMZMwzsbTp9EuXLXQvxgY@PPehIWfHk@XQDFaFzwxMOnBOD1NfgA9AzshHp6gBotAGkcjqzIlh2OaVQ7nslRXISCFXTGzJpZqW7OseHmLcmaSqGCl4E6p1ael4bYejoa7TX3vnHGeFqmaqt2xPiTqL0vzTdN@ev2/nkAWWdZ5@vAAvYrxDQ) in Python 3 with sizes from 10000 to 20000 now, and both genexp and listcomp took almost the same memory, alternating between which one took less (the ratio danced around 1.0). – Kelly Bundy Feb 12 '23 at 16:11
  • @KellyBundy Right. But in Python 2, a list comp didn't have that Frame object, since it ran in the surrounding scope. – PM 2Ring Feb 12 '23 at 16:13
76
>>>''.join( str(_) for _ in xrange(10) )

This is called a generator expression, and is explained in PEP 289.

The main difference between generator expressions and list comprehensions is that the former don't create the list in memory.

Note that there's a third way to write the expression:

''.join(map(str, xrange(10)))
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 1
    As I know it, a generator can be produce through a tuple-like expression, like `( str(_) for _ in xrange(10) )`. But I was confused that, why the `()` can be omited in `join`, which means, the code should be like `''.join( (str(_) for _ in xrange(10)) ), right? – Alcott Jan 30 '12 at 07:44
  • 3
    @Alcott My understanding of tuples is that they are actually defined by the comma-separated list of expressions and not the parenthesis; the parenthesis are only there to visually group the values in an assignment or to actually group the values if the tuple were going into some other comma-separated list, like a function call. This is often demonstrated by running code like `tup = 1, 2, 3; print(tup)`. With that in mind, using `for` as a part of an expression creates the generator and the parenthesis are just there to distinguish it from an incorrectly written loop. – Eric Ed Lohmar Oct 16 '19 at 20:08
6

Your second example uses a generator expression rather than a list comprehension. The difference is that with the list comprehension, a list is completely built and passed to .join(). With the generator expression, items are generated one by one and consumed by .join(). The latter uses less memory and is generally faster.

As it happens, the list constructor will happily consume any iterable, including a generator expression. So:

[str(n) for n in xrange(10)]

is just "syntactic sugar" for:

list(str(n) for n in xrange(10))

In other words, a list comprehension is just a generator expression that is turned into a list.

kindall
  • 178,883
  • 35
  • 278
  • 309
  • 2
    Are you sure they're equivalent under the hood? Timeit says: `[str(x) for x in xrange(1000)]`: 262 usec, `list(str(x) for x in xrange(1000))`: 304 usec. – Lauritz V. Thaulow Jan 30 '12 at 08:53
  • 2
    @lazyr You are right. List comprehension is faster. And this is the reason why list comprehensions leak in Python 2.x. This is what GVR wrote: ""This was an artifact of the original implementation of list comprehensions; it was one of Python's "dirty little secrets" for years. It started out as an intentional compromise to make list comprehensions blindingly fast, and while it was not a common pitfall for beginners, it definitely stung people occasionally." http://python-history.blogspot.com/2010/06/from-list-comprehensions-to-generator.html – ovgolovin Jan 30 '12 at 09:14
  • 4
    @ovgolovin The reason the listcomp is faster is because *join* has to create a list before it can start work. The "leak" you refer to isn't a speed issue -- it just means that the loop induction variable is exposed outside the listcomp. – Raymond Hettinger Jan 30 '12 at 09:56
  • 1
    @RaymondHettinger Then what do these word mean "It started out as an **intentional compromise to make list comprehensions blindingly fast**"? As I understood there is a connection of their leakage with the speed issues. GVR also wrote: "For generator expressions we could not do this. Generator expressions are implemented using generators, whose execution requires a separate execution frame. Thus, **generator expressions** (especially if they iterate over a short sequence) **were less efficient than list comprehensions**." – ovgolovin Jan 30 '12 at 10:08
  • 1
    Still, it's important to add these words of GVR: "And before you start worrying about list comprehensions becoming slow in Python 3: thanks to the enormous implementation effort that went into Python 3 to speed things up in general, **both list comprehensions and generator expressions in Python 3 are actually faster than they were in Python 2! (And there is no longer a speed difference between the two.)**" – ovgolovin Jan 30 '12 at 10:11
  • 5
    @ovgolovin You've made an incorrect leap from a listcomp implementation detail to why str.join performs the way it does. One of the first lines in the str.join code is ``seq = PySequence_Fast(orig, "");`` and that is the sole reason iterators run more slowly than lists or tuples when calling str.join(). You're welcome to start-up a chat if you want to discuss it further (I'm the author of PEP 289, the creator of the LIST_APPEND opcode, and the one who optimized the list() constructor, so I do have some familiarity with the issue). – Raymond Hettinger Jan 30 '12 at 16:18
  • BTW, I did not mean to imply that they are equivalent under the hood -- surely that would be an implementation-specific detail -- just that the list comprehension is basically a special case of a generator expression with a little sweetening thrown in. – kindall Jan 30 '12 at 17:09
  • @RaymondHettinger I made my second comment regarding mentioning in the answer that list comprehension is just merely a "syntactic sugar" for `list(*generator expression*)`. And all I knew about it was from the article I mentioned in the comments. So my knowledge of the issue is rather superficial and consisting almost of the maxim that list comprehensions in Python 2.x are faster than generator expressions. And nothing did I know that `''.join()` requires the length of the object before it may construct the string and that in this particular case it's the main reason of slowing down. – ovgolovin Jan 30 '12 at 17:17
  • @RaymondHettinger And here I'm writing comments more out of curiosity. I hope it's interesting not only for me! At least it's not that obvious that `''.join()` is better off with list comprehensions than generator expression. – ovgolovin Jan 30 '12 at 17:24
  • @RaymondHettinger One more question if you don't mind. As I understand, `PySequence_Fast(...)` is nearly the same as `tuple(...)` or `list(...)` (depending on the object supplied as the parameter). So if we call `''.join(*generator expression*)`, the gen.expression will be converted to tuple. So it's nearly the same as `''.join(tuple(*generator expression*))`. But `''.join(*list comprehension*)` is nearly the same as `''.join(list(*generator expression*))`. The only difference between the last 2 ones is former is implemented a bit differently, so working faster and having leakage. – ovgolovin Jan 30 '12 at 17:37
  • @RaymondHettinger So `''.join(*generator expression*)` turns out to perform nearly the same operations that are performed by `''.join(*list comprehension*)` with the only difference that `*list comprehension*` creates a list before it's sent as a parameter to `join`, and `*generator expression*` is used to create a tuple only after it's received by `join`, which calls `PySequence_Fast` on it. Am I correct in my thinking? – ovgolovin Jan 30 '12 at 17:41
  • 1
    @ovgolovin *PySequence_Fast(x)* is roughly equivalent to ``x if isinstance(x, (list, tuple)) else list(x)``. In other words, if the argument is already a list or tuple, it doesn't need to do any work at all. This is different from ``list(x)`` which always makes a new list. – Raymond Hettinger Jan 31 '12 at 00:22
6

As mentioned it's a generator expression.

From the documentation:

The parentheses can be omitted on calls with only one argument. See section Calls for the detail.

Georgy
  • 12,464
  • 7
  • 65
  • 73
monkut
  • 42,176
  • 24
  • 124
  • 155
4

If it's in parens, but not brackets, it's technically a generator expression. Generator expressions were first introduced in Python 2.4.

http://wiki.python.org/moin/Generators

The part after the join, ( str(_) for _ in xrange(10) ) is, by itself, a generator expression. You could do something like:

mylist = (str(_) for _ in xrange(10))
''.join(mylist)

and it means exactly the same thing that you wrote in the second case above.

Generators have some very interesting properties, not the least of which is that they don't end up allocating an entire list when you don't need one. Instead, a function like join "pumps" the items out of the generator expression one at a time, doing its work on the tiny intermediate parts.

In your particular examples, list and generator probably don't perform terribly differently, but in general, I prefer using generator expressions (and even generator functions) whenever I can, mostly because it's extremely rare for a generator to be slower than a full list materialization.

sblom
  • 26,911
  • 4
  • 71
  • 95
2

That's a generator, rather than a list comprehension. Generators are also iterables, but rather than creating the entire list first then passing it to join, it passes each value in the xrange one by one, which can be much more efficient.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
0

The argument to your second join call is a generator expression. It does produce an iterable.

Michael J. Barber
  • 24,518
  • 9
  • 68
  • 88