3

Today I was trying to find a method, to do some processing on strings in python. Some more senior programmer than I'm said not to use += but use ''.join() I could also read this in eg: http://wiki.python.org/moin/PythonSpeed/#Use_the_best_algorithms_and_fastest_tools . But I tested this myself and found a bit strange results ( It's not that I'm trying to second guess them but I want to under stand). The idea was if there was a string "This is \"an example text\" containing spaces" the string should be converted to Thisis"an example text"containingspaces The spaces are removed, but only outside the quotes.

I measured the performance of two different versions of my algorithm one using the ''.join(list) and one using +=

import time

#uses '+=' operator
def strip_spaces ( s ):
    ret_val = ""
    quote_found = False
    for i in s:
        if i == '"':
            quote_found = not quote_found

        if i == ' ' and quote_found == True:
            ret_val += i

        if i != ' ':
            ret_val += i
    return ret_val

#uses "".join ()   
def strip_spaces_join ( s ):
    #ret_val = ""
    ret_val = []
    quote_found = False
    for i in s:
        if i == '"':
            quote_found = not quote_found

        if i == ' ' and quote_found == True:
            #ret_val = ''.join( (ret_val, i) )
            ret_val.append(i)

        if i != ' ':
            #ret_val = ''.join( (ret_val,i) )
            ret_val.append(i)
    return ''.join(ret_val)


def time_function ( function, data):
    time1 = time.time();
    function(data)
    time2 = time.time()
    print "it took about {0} seconds".format(time2-time1)

On my machine this yielded this output with a minor advantage for the algorithm using +=

print '#using += yields ', timeit.timeit('f(string)', 'from __main__ import string, strip_spaces as f', number=1000)
print '#using \'\'.join() yields ', timeit.timeit('f(string)', 'from __main__ import string, strip_spaces_join as f', number=1000)

when timed with timeit :

#using += yields  0.0130770206451
#using ''.join() yields  0.0108470916748

The difference is really minor. But why is ''.join() not clearly out performing the function that uses += , but there seems to be a small advantage for the ''.join() version. I tested this on Ubuntu 12.04 with python-2.7.3

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
hetepeperfan
  • 4,292
  • 1
  • 29
  • 47
  • 6
    When testing timings, use the `timeit` module. – Martijn Pieters May 27 '13 at 21:58
  • 1
    For strings larger than about 400 characters, using a list with `join` is faster. – Blender May 27 '13 at 22:02
  • @Blender the strings are of length 43000000 lenght in the end, or am i misinterpreting you? – hetepeperfan May 27 '13 at 22:05
  • 2
    You aren't really comparing the same things here either. You're comparing multiple `+=` vs. multiple `append`s followed by a single `''.join()`. – Jeff Tratner May 27 '13 at 22:05
  • @hetepeperfan: You're performing only one trial for each string. That's not really enough to draw a conclusion from. – Blender May 27 '13 at 22:06
  • @Blender as i noted i expected ''.join() to outperform += say it would be at least 2 time as fast. – hetepeperfan May 27 '13 at 22:07
  • @MartijnPieters I see i should use the timeit module;-) – hetepeperfan May 27 '13 at 22:08
  • Updated to use the timeit module – hetepeperfan May 27 '13 at 22:36
  • 2
    I feel I should mention: it's not all about performance. Readability is important too (as well as being able to debug and reason about the code), `.join` wins every time here. – Andy Hayden May 27 '13 at 23:13
  • @AndyHayden I agree with you readabilty counts. But also therefore I was really amazed while learning about `''.join()` you really have to append `.join()` with an empty string which is why I prefer `string += string` above `''.joing(string,string)` if we are talking about readability. Perhaps I must get used to that part of pythonic syntax:). – hetepeperfan May 27 '13 at 23:18

4 Answers4

8

Do use the correct methodology when comparing algorithms; use the timeit module to eliminate fluctuations in CPU utilization and swapping.

Using timeit shows there is little difference between the two approaches, but ''.join() is slightly faster:

>>> s = 1000 * string
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces as f', number=100)
1.3209099769592285
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces_join as f', number=100)
1.2893600463867188
>>> s = 10000 * string
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces as f', number=100)
14.545105934143066
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces_join as f', number=100)
14.43651008605957

Most of the work in your functions is the looping over each and every character and testing for quotes and spaces, not string concatenation itself. Moreover, the ''.join() variant does more work; you are appending the elements to a list first (this replaces the += string concatenation operations), then you are concatenating these values at the end using ''.join(). And that method is still ever so slightly faster.

You may want to strip back the work being done to compare just the concatenation part:

def inplace_add_concatenation(s):
    res = ''
    for c in s:
        res += c

def str_join_concatenation(s):
    ''.join(s)

which shows:

>>> s = list(1000 * string)
>>> timeit.timeit('f(s)', 'from __main__ import s, inplace_add_concatenation as f', number=1000)
6.113742113113403
>>> timeit.timeit('f(s)', 'from __main__ import s, str_join_concatenation as f', number=1000)
0.6616439819335938

This shows ''.join() concatenation is still a heck of a lot faster than +=. The speed difference lies in the loop; s is a list in both cases, but ''.join() loops over the values in C, while the other version has to do all it's looping in Python. And that makes all the difference here.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • There just was another answer that promoted `+=` does this also apply to the implace_add_concatenation in your answer, because in that case I would definitely (still naive) have written += and not iterate over all single characters instead of ''.join() – hetepeperfan May 27 '13 at 22:40
  • I am not sure I follow you. – Martijn Pieters May 27 '13 at 22:44
  • You've written def inplace_add_concatenation(s): res = '' for c in s: res += c But here you iterate over all characters in the string istead of doing `res += s` which doesn't iterate but i'll get the message anyway. Thant res += c – hetepeperfan May 27 '13 at 22:48
  • @hetepeperfan: Yes, because that is basically what you have to do to concatenate all elements of sequence `s` into one string. That is one of the reasons why `''.join(s)` is *fast*; it doesn't have to do such looping (it does that in C instead). But your sample function from your question has to loop *anyway*, and thus `''.join()` cannot win; all the work you do **requires** you to loop over the input characters one by one. You are thus not able to take advantage of `''.join()`. – Martijn Pieters May 27 '13 at 22:50
  • Also, `+=` very susceptible to typos and/or confusing code... \*shudders\*. – Andy Hayden May 27 '13 at 23:00
  • @AndyHayden I guess this last remark is a bit of taste otherwise isn't this the end of operator overloading? – hetepeperfan May 27 '13 at 23:04
  • @hetepeperfan I think operator overloading is very useful, and you can use it without the `+=` idiom. I agree it's a matter of taste though... – Andy Hayden May 27 '13 at 23:09
3

Another option is to write a function which joins using a generator, rather than appending to a list each time.

For example:

def strip_spaces_gen(s):
    quote_found = False
    for i in s:
        if i == '"':
            quote_found = not quote_found
        if i == ' ' and quote_found == True:
            # Note: you (c|sh)ould drop the == True, but I'll leave it here so as to not give an unfair advantage over the other functions
            yield i
        if i != ' ':
            yield i

def strip_spaces_join_gen(ing):
     return ''.join(strip_spaces_gen(ing))

This appears to be about the same (as a join) for a shorter string:

In [20]: s = "This is \"an example text\" containing spaces"

In [21]: %timeit strip_spaces_join_gen(s)
10000 loops, best of 3: 22 us per loop

In [22]: %timeit strip_spaces(s)
100000 loops, best of 3: 13.8 us per loop

In [23]: %timeit strip_spaces_join(s)
10000 loops, best of 3: 23.1 us per loop

But faster for larger strings.

In [24]: s = s * 1000

In [25]: %timeit strip_spaces_join_gen(s)
100 loops, best of 3: 12.9 ms per loop

In [26]: %timeit strip_spaces(s)
100 loops, best of 3: 17.1 ms per loop

In [27]: %timeit strip_spaces_join(s)
100 loops, best of 3: 17.5 ms per loop
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • What's with the `boolean_name == True`? Drop the `== True` and you gain performance as you don't need to test the obvious.. – Martijn Pieters May 27 '13 at 22:43
  • @MartijnPieters copy and pasting the OP's source is my excuse :) (also I think there may be some other gains to be had, but thought it would be unfair to distort that bit of code, since we are really testing the generator/join bit) – Andy Hayden May 27 '13 at 22:47
  • Yeah, I agree there are other things that could be done more efficiently. And I admit to being a little unfair about the `== True` part, I did see the OP make that mistake originally. :-) – Martijn Pieters May 27 '13 at 22:51
  • 1
    And the reason the generator function is faster for longer strings is that by that time you start beating out the `list.append()` calls which have to grow a list element by element. Internally, `''.join()` still has to build a list, but can do that in C instead. – Martijn Pieters May 27 '13 at 22:54
3

(This may be lots of detail the OP is already aware of, but going through the issue in full could help others who end up on this question)

The problem in mystring += suffix is that strings are immutable, so this is actually equivalent to mystring = mystring + suffix. So the implementation has to create a new string object, copy all of the characters from mystring over to it and then copy all the characters from suffix after that. Then the mystring name is rebound to refer to the new string; the original string object that was referred to by mystring is untouched.

On its own, that's not actually a problem. Any method of concatenating these two strings has to do this, including ''.join([mystring, suffix]); this is actually worse, because it has to first construct a list object and then iterate over it, and while there's no actual data transfer in splicing the empty string between mystring and suffix that'll take at least one instruction to sort out.

Where += becomes a problem is when you do it repeatedly. Something like this:

mystring = ''
for c in 'abcdefg' * 1000000:
    mystring += c

Remember that mystring += c is equivalent to mystring = mystring + c. So on the first iteration of the loop it evaluates '' + 'a' copying 1 character total. Next it does 'a' + 'b' copying 2 characters total. Then 'ab' + 'c' for 3 characters, then 'abc' + 'd' for 4, and I think you can see where this is going. Every subsequent += is repeating all of the work of the previous one, and then copying the new string as well. This gets extremely wasteful.

''.join(...) is better because there you wait until you know all of the strings to copy any of them, and then copy each one directly to its correct place in the final string object. Contrary to what some of the comments and answers have said, this remains the case even if you have to modify your loop to append the strings to a list of strings, and then join them after the loop. Lists are not immutable, so appending to a list modifies it in place, and it also only needs to append a single reference rather than copying all the characters in a string. Doing thousands of appends to a list is much faster than doing thousands of string += operations.

Repeated string += is theoretically a problem even without loops, if you just write your source code something like:

s = 'foo'
s += 'bar'
s += 'baz'
...

But in practice you're fairly unlikely to write a long enough sequence of code like that by hand, unless the strings involved are really huge. So just watch out for += in loops (or recursive functions).


The reason why you may not see this result when you try to time it, is that there is actually an optimization to string += in the CPython interpreter. Lets go back to my silly example loop:

mystring = ''
for c in 'abcdefg' * 1000000:
    mystring += c

Every time this does mystring = mystring + c the old value of mystring becomes garbage and is deleted, and the name mystring ends up referring to a newly created string that begins exactly with the contents of the old object. We could optimise this by recognising that mystring is about to become garbage, so we can do whatever we like to it without anybody caring. So even though strings are immutable at Python level, at the implementation level we'll make them dynamically expandable, and we'll implement target += source by either doing the normal allocate a new string and copy method, or by expanding the target string and copying only the source characters, depending on whether target would be made garbage.

The problem with this optimization is that it's easy to disrupt. It works absolutely fine on small self-contained loops (which are the easiest to convert to using join, by the way). But if you're doing something more complex and you ever accidentally end up with more than one reference to the strings, suddenly the code can run a lot slower.

Say you've got some logging calls in the loop, and the logging system buffers its messages for a while so as to print them all at once (should be safe; strings are immutable). The references to your strings within the logging system could stop the += optimization being applicable.

Say you've written your loop as a recursive function (which Python doesn't really like anyway, but still) building up a string with += for some reason. The outer stack frames will still have references to the old values.

Or maybe what you're doing with the strings is generating a series of objects, so you're passing them to a class; if the class stores the string directly in the instance, the optimization goes away, but if the class manipulates them first then the optimization still works.

Essentially, the performance of what looks like a really basic primitive operation is either okay or really bad, and which it is depends on other code than the code using +=. In extreme cases you could have a change to a completely separate file (maybe even a third party package) introduce a massive performance regression in one of your modules that hasn't changed in a long time!

Plus, my understanding is that the += optimization is only easy to implement on CPython, because it makes use of reference counting; you can easily tell when the target string is garbage by looking at its reference count, whereas with more sophisticated garbage collection you can't tell until you remove the reference and wait until the garbage collector runs; far too late to make a decision about how to implement +=. So again, really simple basic code that shouldn't have any issues being portable between Python implementations can suddenly run too slowly to be useful when you move it to another implementation.


And here's some benchmarking to show the scale of the problem:

import timeit

def plus_equals(data):
    s = ''
    for c in data:
        s += c

def simple_join(data):
    s = ''.join(data)

def append_join(data):
    l = []
    for c in data:
        l.append(c)
    s = ''.join(l)

def plus_equals_non_garbage(data):
    s = ''
    for c in data:
        dummy = s
        s += c

def plus_equals_maybe_non_garbage(data):
    s = ''
    for i, c in enumerate(data):
        if i % 1000 == 0:
            dummy = s
        s += c

def plus_equals_enumerate(data):
    s = ''
    for i, c in enumerate(data):
        if i % 1000 == -1:
            dummy = s
        s += c

data = ['abcdefg'] * 1000000

for f in (
    plus_equals,
    simple_join,
    append_join,
    plus_equals_non_garbage,
    plus_equals_maybe_non_garbage,
    plus_equals_enumerate,
  ):
    print '{:30}{:20.15f}'.format(f.__name__, timeit.timeit(
        'm.{0.__name__}(m.data)'.format(f),
        setup='import __main__ as m',
        number=1
    ))

On my system this prints:

plus_equals                      0.066924095153809
simple_join                      0.013648986816406
append_join                      0.086287975311279
plus_equals_non_garbage        540.663727998733521
plus_equals_maybe_non_garbage    0.731688976287842
plus_equals_enumerate            0.156824111938477

The optimization of += works very well when it works (even beating the dumb append_join version by a little). My numbers suggest you might be able to optimize code by replacing append + join with += in some cases, but the benefit is not worth the risk of some other future change accidentally getting the blowout (and would most likely be vanishingly small if there was any other actual work going on in the loop; and if there isn't, then you should be using something like the simple_join version).

By comparing plus_equals_maybe_non_garbage to plus_equals_enumerate you can see that even if the optimization only fails one in every thousand += operations, there's still a 5-fold loss of performance.

The optimization of += is really only intended as a crutch to save people who aren't experienced Python programmers, or who are just quickly and lazily writing some scratch code. If you're thinking about what you're doing, you should be using join.


Summary: Using += is fine for a fixed small number of concatenations. join is always better for using loops to build up strings. In practice you may not see a huge improvement porting code from += to join, due to the += optimization. You should still use join anyway, because the optimization is unreliable and the difference when it fails to kick in can be enormous.

Ben
  • 68,572
  • 20
  • 126
  • 174
  • thanks, for another elaborate answer:) I agree that I used `+=` in many cases because I simply wasn't aware of `''.join()` I used `+=` because I had some experience with other object oriented languages eg'C++'. Altough I'm aware that the same problems might arise in C++ too. – hetepeperfan May 28 '13 at 08:02
1

The difference in performance between += and .join depends on a lot of factors:

  1. The operating system. Running this for increasingly larger strings on unix-like or Windows systems can give quite different results. Typically, you will see a much more pronounce increase in run times under Windows.

  2. The Python implementation. Per default we talk about CPython but there other implementations such as Jython or PyPy. Let's have a look at PyPy. Using the source code from the answer above:

    CPython 2.7:
    python concat.py 
    inplace_add_concatenation: 0.420897960663
    str_join_concatenation:    0.061793088913
    ratio: 6.81140833169
    
    PyPy 1.9:
    pypy concat.py 
    inplace_add_concatenation: 1.26573014259
    str_join_concatenation:    0.0392870903015
    ratio: 32.2174570038
    

Even though PyPy is famous for its speed-ups compared to CPython, it is slower for the += version. It was a deliberate desision not include the `+=' optimization in the default version of PyPy.

Rule of thumb dealing with performance: "Never guess, always measure."

Also reading the docs helps:

6 CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations."""

from http://docs.python.org/2/library/stdtypes.html#typesseq

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
  • 1
    +1 for "Never guess, always measure.". Performance critical components live and die on that principle. – Nisan.H May 27 '13 at 23:21