13

The join() function accepts an iterable as parameter. However, I was wondering why having:

text = 'asdfqwer'

This:

''.join([c for c in text])

Is significantly faster than:

''.join(c for c in text)

The same occurs with long strings (i.e. text * 10000000).

Watching the memory footprint of both executions with long strings, I think they both create one and only one list of chars in memory, and then join them into a string. So I am guessing perhaps the difference is only between how join() creates this list out of the generator and how the Python interpreter does the same thing when it sees [c for c in text]. But, again, I am just guessing, so I would like somebody to confirm/deny my guesses.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Peque
  • 13,638
  • 11
  • 69
  • 105

1 Answers1

13

The join method reads its input twice; once to determine how much memory to allocate for the resulting string object, then again to perform the actual join. Passing a list is faster than passing a generator object that it needs to make a copy of so that it can iterate over it twice.

A list comprehension is not simply a generator object wrapped in a list, so constructing the list externally is faster than having join create it from a generator object. Generator objects are optimized for memory efficiency, not speed.

Of course, a string is already an iterable object, so you could just write ''.join(text). (Also again this is not as fast as creating the list explicitly from the string.)

chepner
  • 497,756
  • 71
  • 530
  • 681
  • 1
    I just tested this with a generator that prints. `join` did not execute the generator twice. It's a fairly far-fetched claim. What makes you think it executes the generator twice? – Daniel Darabos Sep 08 '15 at 15:58
  • 1
    @DanielDarabos *"that it needs to make a copy of"* - by consuming it into a list! – jonrsharpe Sep 08 '15 at 16:00
  • 2
    Reproducing my question comment here since you mentioned it in your answer: Interestingly `timeit` on my system shows that a list is faster than the direct string iteration here as well. – Two-Bit Alchemist Sep 08 '15 at 16:02
  • 2
    @Two-BitAlchemist that's because [`PySequence_Fast`](https://hg.python.org/cpython/file/tip/Objects/abstract.c#l1768) only special-cases lists and tuples - everything else (including strings) gets processed one extra time. – jonrsharpe Sep 08 '15 at 16:06
  • 1
    Generator objects are slow. They are optimized for memory usage, not speed of iteration. Creating a list from an existing generator takes more time than building the list using a list comprehension. (Despite the similarity in syntax, a list comprehension is not just a generator object wrapped in a list.) – chepner Sep 08 '15 at 16:31
  • @chepner I am wondering why generator produced by generator expression is slower? Why is not it optimized in form of some PyGenerator_Fast – Andrey Sep 08 '15 at 17:12
  • 1
    @Andrey That's not a very good test because you're using a short-cicruiting function and we're talking about one that has to reprocess a generator (and therefore must make a copy). Additionally, in your test, generators were _still_ very slightly slower (though not significantly). – Two-Bit Alchemist Sep 08 '15 at 17:34
  • @Two-BitAlchemist in this example it doesn't short circuit. all will not stop while true (1). – Andrey Sep 08 '15 at 17:49
  • 1
    @Andrey Yes, but that's not the point. `all` must use the generator at most once. `join` must use the generator _at least twice_. – Two-Bit Alchemist Sep 08 '15 at 17:51
  • @Two-BitAlchemist my comment was not really about join, it was response to comment "Generator objects are slow." – Andrey Sep 08 '15 at 17:52
  • @Peque That's the only way it's possible to use a generator twice. – Two-Bit Alchemist Sep 08 '15 at 18:19