Isn't CPython's str.join() a little inefficient?

Question

This answer and its comments provide some insight into the inner working's of CPython's str.join():

If the argument is not already a list or a tuple, a new list is created with the same contents.
The argument is iterated over once, to sum the lengths of the strings it holds.
Memory is allocated for a new string.
Finally, the argument is iterated over a second time, and the strings are copied into the memory for the new string.

This seems questionable to me. For starters, why reject all sequence types but two? Wouldn't just iterating over any sequence twice instead of copying it be much faster? And why make a list, particularly if you can't know the length of the iterable you're making it from? You don't need random access, just repeated iteration, and using a list means you might have to reallocate and copy several times during its generation. Wouldn't it make more sense to use a linked list or a deque?

Can anyone provide some insights into these design decisions?

For strings of an reasonable size there is far less data that has to be reallocated when growing the list than for resizing the string being built. — user2864740, Apr 12 '15 at 17:18
@user2864740 Compare that to using a linked list, which would always require exactly zero reallocations for the entire run. — Blacklight Shining, Apr 12 '15 at 17:19
Not true at all. The cells in a linked list must be allocated. The difference is merely if there are allocated in a single array (as per `list`) or if they are allocated individually. In both cases *the items within are not cloned*. That is the `list` allocation scheme effectively acts like a SLAB. — user2864740, Apr 12 '15 at 17:20
@user2864740 O(n) allocations to build the linked list. No _re_-allocations. — Blacklight Shining, Apr 12 '15 at 17:21
The difference between the two allocation schemes has been discussed over and over. There are very few cases when a linked list is even 'better' in this aspect. There are *significantly less overall allocations* for an array-backed structure, but it requires a copy after the allocations and slack is required to reduce the overall allocations (and copies of the cells). Copies are generally *cheaper* than allocations. — user2864740, Apr 12 '15 at 17:22
@BlacklightShining note that a Python list is an array of references to objects, not of the objects themselves — jonrsharpe, Apr 12 '15 at 17:23
@jonrsharpe Right. And to increase the size of that array (once you run out of slack), you have to allocate a new, larger, block of memory, copy all of the elements over, and free the old block. — Blacklight Shining, Apr 12 '15 at 17:25
@BlacklightShining It doesn't matter in the end. The array-backed list effectively acts as it's own slab. There are only ~O(lg n) reallocation required (with a grow factor of two). This ends up being much better than O(n) new node allocations in many cases, especially considering it avoids per-node overheads. *Even if* there is slight difference here (my money is still on an array-list implementation) it is inconsequential to the overall operation cost of the join. — user2864740, Apr 12 '15 at 17:25
Related: [`itertools.tee`](https://docs.python.org/2/library/itertools.html#itertools.tee). Note the comment: *"In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use `list()` instead of `tee()`."* — jonrsharpe, Apr 12 '15 at 17:27
Copying a list of references from one memory block to another at the C level is trivial if you know it's just a move - no reference counts need to be updated. It's just a `memcpy`. You only end up doing it log(n) times. — Mark Ransom, Apr 13 '15 at 01:33

BrenBarn · Accepted Answer · 2015-04-12T17:24:45.673

2

For starters, why reject all sequence types but two? Wouldn't just iterating over any sequence twice instead of copying it be much faster?

The argument of join need not be a sequence. It be any iterable, and some iterables cannot be iterated over more than once. It could, for instance, be a generator expression, which would be exhausted after iterating over it once.

As to your second question, I don't know specifically, although I'd guess that using lists and tuples internally simplifies the implementation at the C level. I think the broader answer to your question is that no one was really intending to make every possible optimization to str.join. I would guess that the vast majority of use cases are calling it on a list or tuple anyway.

edited Apr 12 '15 at 17:24

answered Apr 12 '15 at 17:12

BrenBarn

242,874
37
412
384

Yes, but why reject all _sequence_ types? You could just try to call `__getitem__()`, and fall back to creating a `list` out of it. – Blacklight Shining Apr 12 '15 at 17:14
@BlacklightShining because `__getitem__` is an attribute of (some) Python objects - it doesn't exist on the objects being manipulated in C behind the scenes (in CPython - you could consider looking at e.g. the PyPy implementation of `str.join`). – jonrsharpe Apr 12 '15 at 17:20
@BlacklightShining: `list` and `tuple` are basically the only types you can be sure are sequences (other than `str` and `unicode`, which aren't useful for `join`, and maybe a few esoteric ones like `xrange`). There can be mapping types that have `__getitem__` but aren't sequence types (like `dict`). You could of course use `__getitem__`-based iteration twice, but that is riskier since you don't know what user-defined types might be doing with that. – BrenBarn Apr 12 '15 at 17:21
@BrenBarn Why not let that happen with mapping types? If the keys happen to be all `str`ings, then you get a string of the keys, separated by whatever the separating string is. If not, it fails anyway. – Blacklight Shining Apr 12 '15 at 17:27
@BlacklightShining that's not what `__getitem__` does on e.g. a `dict`; you would get the values for integer keys, but there might be missing numbers and non-integer keys. – jonrsharpe Apr 12 '15 at 17:30
@jonrsharpe You're right; I was confusing `__iter__()` and `__getitem__()`. – Blacklight Shining Apr 12 '15 at 17:32
Okay. What about detecting non-container iterables by checking for `__next__()`, building `list`s from those, and just using `__iter__()` for everything else? I doubt no one's thought about optimizing `str.join()`; it seems like a really commonly-used feature. – Blacklight Shining Apr 12 '15 at 17:36
1

@BlacklightShining again, that's all at the Python level - `str.join` is [*implemented in C*](https://hg.python.org/cpython/file/08adaaf08697/Objects/stringlib/join.h). – jonrsharpe Apr 12 '15 at 17:38
@jonrsharpe There…isn't a way to look up object attributes at the C level? – Blacklight Shining Apr 12 '15 at 17:41
2

@BlacklightShining those attributes *don't exist* at the C level. You should probably look at [`PySequence_Fast`](https://hg.python.org/cpython/file/08adaaf08697/Objects/abstract.c#l1768), which is where the functionality you refer to is implemented (i.e. lists and tuples are passed through, everything else gets turned into a list). – jonrsharpe Apr 12 '15 at 17:50
4

@BlacklightShining: All these comments boil down to what I said at the end of my answer: the reason none of these optimizations are made is because no one cares. It's not that people haven't thought about optimizing `str.join`; like I said in my answer, it's because people don't care about optimizing `str.join` for unusual use-cases involving custom iterable types, because most of them you're using it with a list or tuple anyway. – BrenBarn Apr 12 '15 at 18:00

Isn't CPython's str.join() a little inefficient?

1 Answers1