Why is 'join' faster than normal concatenation?

Question

I've seen several examples from different languages that unambiguously prove that joining elements of a list (array) is many times faster than just concatenating string. Why?

What is the inner algorithm that works under both operations and why is the one faster than another?

Here is a Python example of what I mean:

# This is slow
x = 'a'
x += 'b'
...
x += 'z'

# This is fast
x = ['a', 'b', ... 'z']
x = ''.join(x)

Here's the source: http://svn.python.org/view/python/trunk/Objects/stringobject.c?view=markup. When you read the source for join, what did you learn about the speed of `join`? — S.Lott, Feb 24 '10 at 15:16

score 15 · Answer 1 · answered Feb 24 '10 at 09:51

The reason is that strings in Python (and many other languages) are immutable objects - that is, once created, they can't be changed. Instead, concatenating a string actually makes a new string which consists of the contents of the two smaller strings being concatenated, and then replaces the old string with the new one.

Since creating a string takes a certain amount of time (need to allocate memory, copy the contents of the string to that memory, et cetera), making many strings takes longer than making a single string. Doing N concatenations requires creating N new strings in the process. join(), on the other hand, only has to create a single string (the final result) and thus works much faster.

score 15 · Accepted Answer · edited Mar 31 '22 at 17:40

The code in a join function knows upfront all the strings it’s being asked to concatenate and how large those strings are, and hence it can calculate the final string length before beginning the operation.

Hence it needs only allocate memory for the final string once and then it can place each source string (and delimiter) in the correct place in memory.

On the other hand, a single += operation on a string has no choice but to simply allocate enough memory for the final string which is the concatenation of just two strings. Subsequent +='s must do the same, each allocating memory which on the next += will be discarded. Each time the evergrowing string is copied from one place in memory to another.

score 3 · Answer 3 · answered Feb 24 '10 at 09:51

This is because a larger and larger chunk of memory has to be allocated for the string concatenation:

x = 'a' # String of size 1 allocated
x += 'b' # String of size 2 allocated, x copied, and 'b' added. Old x discarded
x += 'b' # String of size 3 allocated, x copied, and 'c' added. Old x discarded
x += 'b' # String of size 4 allocated, x copied, and 'd' added. Old x discarded
x += 'b' # String of size 5 allocated, x copied, and 'e' added. Old x discarded

So what happens is you perform large allocations and copies, but then turn around and throw them away. Very wasteful.

x = ['a', 'b', ..., 'z'] # 26 small allocations
x = ''.join(x) # A single, large allocation

You'd earn my upvote if you mentioned something about immutable objects. Not all languages require throwing away existing strings when concatenating. — Amber, Feb 24 '10 at 09:54

score 3 · Answer 4 · edited Mar 31 '22 at 17:41

See Python string join performance and one specific answer that describes it very well:

The advice is about concatenating a lot of strings.

To compute s = s1 + s2 + ... + sn,

using +. A new string s1+s2 is created, then a new string s1+s2+s3 is created,..., etc, so a lot of memory allocation and copy operations is involved. In fact, s1 is copied n-1 times, s2 is copied n-2 time, ..., etc.

using "".join([s1,s2,...,sn]). The concatenation is done in one pass, and each char in the strings is copied only once.

score 2 · Answer 5 · answered Feb 24 '10 at 15:23

2

The other responses have basically covered it, but if you want even more detail, Joel Spolsky has an article where he describes "Schlemiel the painter's algorithm", which is extremely relevant and nicely makes the case for why understanding this sort of low level implementation detail is still very important even if you're working in a high level language like Python.

answered Feb 24 '10 at 15:23

thraxil

4,971
2
19
10

Yes, that is the most important aspect as the list grows long and the performance implications become severe. – Peter Mortensen Mar 31 '22 at 17:43

score 0 · Answer 6 · edited Mar 31 '22 at 17:36

0

Well, this is heavily language dependent, but in general the idea there is, that one big operation is faster than many small ones.

In your second example, the join knows all the elements that it has to join and thus can just allocate the necessary resources and put the characters in.

The concatenation in your first example has to reallocate resources at every single step (worst case).

edited Mar 31 '22 at 17:36

Peter Mortensen

30,738
21
105
131

answered Feb 24 '10 at 09:50

Björn Pollex

75,346
28
201
283

score 0 · Answer 7 · answered Feb 24 '10 at 09:51

I don't know the internals of join, but in the first version you create a new string every time you call the += operator. Since strings are immutable, every time new memory is allocated and a copy is made.

Now, the join (which is a string method) could only do a single allocation, since it can calculate the size beforehand.

Why is 'join' faster than normal concatenation?

7 Answers7