(This may be lots of detail the OP is already aware of, but going through the issue in full could help others who end up on this question)
The problem in mystring += suffix
is that strings are immutable, so this is actually equivalent to mystring = mystring + suffix
. So the implementation has to create a new string object, copy all of the characters from mystring
over to it and then copy all the characters from suffix
after that. Then the mystring
name is rebound to refer to the new string; the original string object that was referred to by mystring
is untouched.
On its own, that's not actually a problem. Any method of concatenating these two strings has to do this, including ''.join([mystring, suffix])
; this is actually worse, because it has to first construct a list object and then iterate over it, and while there's no actual data transfer in splicing the empty string between mystring
and suffix
that'll take at least one instruction to sort out.
Where +=
becomes a problem is when you do it repeatedly. Something like this:
mystring = ''
for c in 'abcdefg' * 1000000:
mystring += c
Remember that mystring += c
is equivalent to mystring = mystring + c
. So on the first iteration of the loop it evaluates '' + 'a'
copying 1 character total. Next it does 'a' + 'b'
copying 2 characters total. Then 'ab' + 'c'
for 3 characters, then 'abc' + 'd'
for 4, and I think you can see where this is going. Every subsequent +=
is repeating all of the work of the previous one, and then copying the new string as well. This gets extremely wasteful.
''.join(...)
is better because there you wait until you know all of the strings to copy any of them, and then copy each one directly to its correct place in the final string object. Contrary to what some of the comments and answers have said, this remains the case even if you have to modify your loop to append the strings to a list of strings, and then join
them after the loop. Lists are not immutable, so appending to a list modifies it in place, and it also only needs to append a single reference rather than copying all the characters in a string. Doing thousands of appends to a list is much faster than doing thousands of string +=
operations.
Repeated string +=
is theoretically a problem even without loops, if you just write your source code something like:
s = 'foo'
s += 'bar'
s += 'baz'
...
But in practice you're fairly unlikely to write a long enough sequence of code like that by hand, unless the strings involved are really huge. So just watch out for +=
in loops (or recursive functions).
The reason why you may not see this result when you try to time it, is that there is actually an optimization to string +=
in the CPython interpreter. Lets go back to my silly example loop:
mystring = ''
for c in 'abcdefg' * 1000000:
mystring += c
Every time this does mystring = mystring + c
the old value of mystring
becomes garbage and is deleted, and the name mystring
ends up referring to a newly created string that begins exactly with the contents of the old object. We could optimise this by recognising that mystring
is about to become garbage, so we can do whatever we like to it without anybody caring. So even though strings are immutable at Python level, at the implementation level we'll make them dynamically expandable, and we'll implement target += source
by either doing the normal allocate a new string and copy method, or by expanding the target string and copying only the source characters, depending on whether target
would be made garbage.
The problem with this optimization is that it's easy to disrupt. It works absolutely fine on small self-contained loops (which are the easiest to convert to using join
, by the way). But if you're doing something more complex and you ever accidentally end up with more than one reference to the strings, suddenly the code can run a lot slower.
Say you've got some logging calls in the loop, and the logging system buffers its messages for a while so as to print them all at once (should be safe; strings are immutable). The references to your strings within the logging system could stop the +=
optimization being applicable.
Say you've written your loop as a recursive function (which Python doesn't really like anyway, but still) building up a string with +=
for some reason. The outer stack frames will still have references to the old values.
Or maybe what you're doing with the strings is generating a series of objects, so you're passing them to a class; if the class stores the string directly in the instance, the optimization goes away, but if the class manipulates them first then the optimization still works.
Essentially, the performance of what looks like a really basic primitive operation is either okay or really bad, and which it is depends on other code than the code using +=
. In extreme cases you could have a change to a completely separate file (maybe even a third party package) introduce a massive performance regression in one of your modules that hasn't changed in a long time!
Plus, my understanding is that the +=
optimization is only easy to implement on CPython, because it makes use of reference counting; you can easily tell when the target string is garbage by looking at its reference count, whereas with more sophisticated garbage collection you can't tell until you remove the reference and wait until the garbage collector runs; far too late to make a decision about how to implement +=
. So again, really simple basic code that shouldn't have any issues being portable between Python implementations can suddenly run too slowly to be useful when you move it to another implementation.
And here's some benchmarking to show the scale of the problem:
import timeit
def plus_equals(data):
s = ''
for c in data:
s += c
def simple_join(data):
s = ''.join(data)
def append_join(data):
l = []
for c in data:
l.append(c)
s = ''.join(l)
def plus_equals_non_garbage(data):
s = ''
for c in data:
dummy = s
s += c
def plus_equals_maybe_non_garbage(data):
s = ''
for i, c in enumerate(data):
if i % 1000 == 0:
dummy = s
s += c
def plus_equals_enumerate(data):
s = ''
for i, c in enumerate(data):
if i % 1000 == -1:
dummy = s
s += c
data = ['abcdefg'] * 1000000
for f in (
plus_equals,
simple_join,
append_join,
plus_equals_non_garbage,
plus_equals_maybe_non_garbage,
plus_equals_enumerate,
):
print '{:30}{:20.15f}'.format(f.__name__, timeit.timeit(
'm.{0.__name__}(m.data)'.format(f),
setup='import __main__ as m',
number=1
))
On my system this prints:
plus_equals 0.066924095153809
simple_join 0.013648986816406
append_join 0.086287975311279
plus_equals_non_garbage 540.663727998733521
plus_equals_maybe_non_garbage 0.731688976287842
plus_equals_enumerate 0.156824111938477
The optimization of +=
works very well when it works (even beating the dumb append_join
version by a little). My numbers suggest you might be able to optimize code by replacing append
+ join
with +=
in some cases, but the benefit is not worth the risk of some other future change accidentally getting the blowout (and would most likely be vanishingly small if there was any other actual work going on in the loop; and if there isn't, then you should be using something like the simple_join
version).
By comparing plus_equals_maybe_non_garbage
to plus_equals_enumerate
you can see that even if the optimization only fails one in every thousand +=
operations, there's still a 5-fold loss of performance.
The optimization of +=
is really only intended as a crutch to save people who aren't experienced Python programmers, or who are just quickly and lazily writing some scratch code. If you're thinking about what you're doing, you should be using join
.
Summary: Using +=
is fine for a fixed small number of concatenations. join
is always better for using loops to build up strings. In practice you may not see a huge improvement porting code from +=
to join
, due to the +=
optimization. You should still use join
anyway, because the optimization is unreliable and the difference when it fails to kick in can be enormous.