8

Consider the problem of extracting alphabets from a huge string.

One way to do is

''.join([c for c in hugestring if c.isalpha()])

The mechanism is clear: The list comprehension generates a list of characters. The join method knows how many characters it needs to join by accessing the length of the list.

Other way to do is

''.join(c for c in hugestring if c.isalpha())

Here the generator comprehension results in a generator. The join method does not know how many characters it is going to join because the generator does not possess len attribute. So this way of joining should be slower than the list comprehension method.

But testing in python shows that it is not slower. Why is this so? Can anyone explain how join works on a generator.

To be clear:

sum(j for j in range(100))

doesn't need to have any knowledge of 100 because it can keep track of the cumulative sum. It can access the next element using the next method on the generator and then add to the cumulative sum. However, since strings are immutable, joining strings cumulatively would create a new string in each iteration. So this would take lot of time.

galan
  • 83
  • 1
  • 4

3 Answers3

14

When you call str.join(gen) where gen is a generator, Python does the equivalent of list(gen) before going on to examine the length of the resulting sequence.

Specifically, if you look at the code implementing str.join in CPython, you'll see this call:

    fseq = PySequence_Fast(seq, "can only join an iterable");

The call to PySequence_Fast converts the seq argument into a list if it wasn't a list or tuple already.

So, the two versions of your call are handled almost identically. In the list comprehension, you're building the list yourself and passing it into join. In the generator expression version, the generator object you pass in gets turned into a list right at the start of join, and the rest of the code operates the same for both versions..

Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • So the difference in speed OP notices should be purely circumstantial, right? – Ma0 Feb 27 '17 at 13:36
  • @Ev.Kounis: The questioner said the two versions were similar in speed ("**not** slower"), which makes sense if they were measuring both the time taken by `join` and the time taken by the list comprehension together. If you measured only the `join`, the generator expression version would be slower since it has to dump the whole generator into a list before doing any string joining. That will take about as much time as building the list comprehension does in the other version. – Blckknght Feb 27 '17 at 20:04
  • One might be deceived that using a generator should be more memory efficient for huge strings... :( – pabouk - Ukraine stay strong Jan 05 '22 at 15:27
1

join() does not need to be implemented as a sequential appending of elements of the sequence to a longer and longer accumulated string (which would indeed be very slow for long sequences); it just needs to produce the same result. So join() is probably just appending characters to some internal memory buffer, and creating a string from it at the end. The list comprehension construct, on the other hand, needs to first construct the list (by traversing hugestring's generator), and only then let join() begin its work.

Also, I doubt that join() looks at the list's length, since it can't know that each element is a single character (in most cases, it won't be) - it probably just obtains a generator from the list.

Aasmund Eldhuset
  • 37,289
  • 4
  • 68
  • 81
  • 3
    The reference interpreter C-layer code sports a complete (but private) `_PyUnicodeWriter` API for this purpose (and other similar "build string piecemeal" cases). Compare to Java's `StringBuilder` class. – ShadowRanger Jan 16 '16 at 02:09
  • 2
    That said, it looks like @Blckknight is correct; it's converting the input to a `list` internally if it's not already a `list` or `tuple`. It also looks like it then does a precompute pass to calculate the length of the final value so as to preallocate exactly as much as it needs, rather than use `_PyUnicodeWriter` at all. – ShadowRanger Jan 16 '16 at 02:20
1

At least on my machine, the list comprehension is faster for the case I tested, likely due to ''.join being able to optimize the memory allocation. It likely just depends on the specific example you're testing (e.g., if the condition you're testing occurs less frequently, the price CPython pays for not knowing length ahead of time may be smaller):

In [18]: s = ''.join(np.random.choice(list(string.printable), 1000000))

In [19]: %timeit ''.join(c for c in s if c.isalpha())
10 loops, best of 3: 69.1 ms per loop

In [20]: %timeit ''.join([c for c in s if c.isalpha()])
10 loops, best of 3: 61.8 ms per loop
Randy
  • 14,349
  • 2
  • 36
  • 42
  • 1
    This is an artifact of list comprehensions being hyperoptimized (they build the `list` directly, where the generator expression just `yield`s values that must be consumed using the general purpose iterator protocol), not anything specific to how `''.join` works. Run the same test, but replace `''.join` with `list` (you can omit it entirely in the second case, where it's redundant). The `list` constructor around a generator expression is significantly slower, and for an input this large, it clearly has nothing to do with lookup or function call costs associated with `list`. – ShadowRanger Jan 20 '16 at 02:29