99

Since Python's strings are immutable, it's inefficient to edit them repeatedly in loops. How can I use a mutable data structure to implement string operations, so as to avoid making lots of temporary strings?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
user2902773
  • 991
  • 1
  • 6
  • 4
  • 4
    You might get a similar effect by building a list of strings and using `join()` on it after the loop. But I'm sure there's a more pythonic way (probably involving list comprehension). – Joachim Sauer Nov 12 '13 at 10:02

9 Answers9

119

Python 3

From the docs:

Concatenating immutable sequences always results in a new object. This means that building up a sequence by repeated concatenation will have a quadratic runtime cost in the total sequence length. To get a linear runtime cost, you must switch to one of the alternatives below: if concatenating str objects, you can build a list and use str.join() at the end or else write to an io.StringIO instance and retrieve its value when complete

Experiment to compare runtime of several options:

import sys
import timeit
from io import StringIO
from array import array


def test_concat():
    out_str = ''
    for _ in range(loop_count):
        out_str += 'abc'
    return out_str


def test_join_list_loop():
    str_list = []
    for _ in range(loop_count):
        str_list.append('abc')
    return ''.join(str_list)


def test_array():
    char_array = array('b')
    for _ in range(loop_count):
        char_array.frombytes(b'abc')
    return str(char_array.tostring())


def test_string_io():
    file_str = StringIO()
    for _ in range(loop_count):
        file_str.write('abc')
    return file_str.getvalue()


def test_join_list_compr():
    return ''.join(['abc' for _ in range(loop_count)])


def test_join_gen_compr():
    return ''.join('abc' for _ in range(loop_count))


loop_count = 80000

print(sys.version)

res = {}

for k, v in dict(globals()).items():
    if k.startswith('test_'):
        res[k] = timeit.timeit(v, number=10)

for k, v in sorted(res.items(), key=lambda x: x[1]):
    print('{:.5f} {}'.format(v, k))

results

3.7.5 (default, Nov  1 2019, 02:16:32) 
[Clang 11.0.0 (clang-1100.0.33.8)]
0.03738 test_join_list_compr
0.05681 test_join_gen_compr
0.09425 test_string_io
0.09636 test_join_list_loop
0.11976 test_concat
0.19267 test_array

Python 2

Efficient String Concatenation in Python is a rather old article and its main statement that the naive concatenation is far slower than joining is not valid anymore, because this part has been optimized in CPython since then. From the docs:

CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations.

I've adapted their code a bit and got the following results on my machine:

from cStringIO import StringIO
from UserString import MutableString
from array import array

import sys, timeit

def method1():
    out_str = ''
    for num in xrange(loop_count):
        out_str += `num`
    return out_str

def method2():
    out_str = MutableString()
    for num in xrange(loop_count):
        out_str += `num`
    return out_str

def method3():
    char_array = array('c')
    for num in xrange(loop_count):
        char_array.fromstring(`num`)
    return char_array.tostring()

def method4():
    str_list = []
    for num in xrange(loop_count):
        str_list.append(`num`)
    out_str = ''.join(str_list)
    return out_str

def method5():
    file_str = StringIO()
    for num in xrange(loop_count):
        file_str.write(`num`)
    out_str = file_str.getvalue()
    return out_str

def method6():
    out_str = ''.join([`num` for num in xrange(loop_count)])
    return out_str

def method7():
    out_str = ''.join(`num` for num in xrange(loop_count))
    return out_str


loop_count = 80000

print sys.version

print 'method1=', timeit.timeit(method1, number=10)
print 'method2=', timeit.timeit(method2, number=10)
print 'method3=', timeit.timeit(method3, number=10)
print 'method4=', timeit.timeit(method4, number=10)
print 'method5=', timeit.timeit(method5, number=10)
print 'method6=', timeit.timeit(method6, number=10)
print 'method7=', timeit.timeit(method7, number=10)

Results:

2.7.1 (r271:86832, Jul 31 2011, 19:30:53) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)]
method1= 0.171155929565
method2= 16.7158739567
method3= 0.420584917068
method4= 0.231794118881
method5= 0.323612928391
method6= 0.120429992676
method7= 0.145267963409

Conclusions:

  • join still wins over concat, but marginally
  • list comprehensions are faster than loops (when building a list)
  • joining generators is slower than joining lists
  • other methods are of no use (unless you're doing something special)
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
georg
  • 211,518
  • 52
  • 313
  • 390
  • 3
    It may be worth nothing that the MutableString class has been deprecated in python 2.6 and removed entirely in Python 3. see [here](https://docs.python.org/2/library/userdict.html?highlight=mutablestring#UserString.MutableString) – Adam Oren Dec 25 '15 at 02:45
  • 2
    Warning! The statement that CPython optimizes this no longer applies in recent versions (v3.5-v3.8+). This has been replaced with a warning that concat'ing immutables in this way is _always_ quadratic: https://docs.python.org/3/library/stdtypes.html – jtschoonhoven Jul 04 '20 at 14:46
  • @jtschoonhoven: I've turned the post into CW, please edit your comment in. Thanks! – georg Jul 07 '20 at 09:14
  • This doesn't answer the question at all – Pod Oct 28 '22 at 14:15
15

Depends on what you want to do. If you want a mutable sequence, the builtin list type is your friend, and going from str to list and back is as simple as:

 mystring = "abcdef"
 mylist = list(mystring)
 mystring = "".join(mylist)

If you want to build a large string using a for loop, the pythonic way is usually to build a list of strings then join them together with the proper separator (linebreak or whatever).

Else you can also use some text template system, or a parser or whatever specialized tool is the most appropriate for the job.

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • Is the complexity of "".join(mylist) O(n) ? –  Sep 22 '15 at 00:42
  • @user2374515 **Yes,** the `str.join()` method is O(n) complexity. Per the [official documentation](http://docs.python.org/2/library/stdtypes.html): "For performance sensitive code, it is preferable to use the `str.join()` method which assures consistent **linear concatenation performance** across versions and implementations." – Cecil Curry Jun 15 '16 at 05:33
13

Perhaps use a bytearray:

In [1]: s = bytearray('Hello World')

In [2]: s[:5] = 'Bye'

In [3]: s
Out[3]: bytearray(b'Bye World')

In [4]: str(s)
Out[4]: 'Bye World'

The appeal of using a bytearray is its memory-efficiency and convenient syntax. It can also be faster than using a temporary list:

In [36]: %timeit s = list('Hello World'*1000); s[5500:6000] = 'Bye'; s = ''.join(s)
1000 loops, best of 3: 256 µs per loop

In [37]: %timeit s = bytearray('Hello World'*1000); s[5500:6000] = 'Bye'; str(s)
100000 loops, best of 3: 2.39 µs per loop

Note that much of the difference in speed is attributable to the creation of the container:

In [32]: %timeit s = list('Hello World'*1000)
10000 loops, best of 3: 115 µs per loop

In [33]: %timeit s = bytearray('Hello World'*1000)
1000000 loops, best of 3: 1.13 µs per loop
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • What encoding will this use? In Java similar constructs would be very problematic, because they use the platform default encoding which can be anything ... – Joachim Sauer Nov 12 '13 at 10:07
  • @JoachimSauer: Like a `str`, the encoding is up to you. As far as the `bytearray` is concerned, every value is just a byte. – unutbu Nov 12 '13 at 10:13
  • bytearray can be useful for really low-level stuff - as the name implies, it's really about "arrays of bytes", not "strings of characters". – bruno desthuilliers Nov 12 '13 at 10:53
  • "...but it is slower than using a temporary list." What is temporary list? Is it (Python's default) list, like `['s', 't', 'r', 'i', 'n', 'g']`? – fikr4n Feb 02 '14 at 22:41
  • @BornToCode: The temporary list would be `mylist` in [bruno desthuilliers' code](http://stackoverflow.com/a/19926210/190597). – unutbu Feb 02 '14 at 22:48
  • `bytearray` is unsuitable for variable-length encodings like UTF-8 (since replacing one character with another of a different byte length would require the rest of the data to shift), and awkward for fixed-multi-byte encodings like UTF-32 (of course, UTF-16 is both, thanks to surrogate pairs). – Karl Knechtel Mar 24 '23 at 22:49
11

The previously provided answers are almost always best. However, sometimes the string is built up across many method calls and/or loops, so it's not necessarily natural to build up a list of lines and then join them. And since there's no guarantee you are using CPython, or that CPython's optimization will apply, an alternative approach is to just use print!

Here's an example helper class, although the helper class is trivial and probably unnecessary, it serves to illustrate the approach (Python 3):

import io

class StringBuilder(object):

    def __init__(self):
        self._stringio = io.StringIO()
    
    def __str__(self):
        return self._stringio.getvalue()
    
    def append(self, *objects, sep=' ', end=''):
        print(*objects, sep=sep, end=end, file=self._stringio)

sb = StringBuilder()
sb.append('a')
sb.append('b', end='\n')
sb.append('c', 'd', sep=',', end='\n')
print(sb)  # 'ab\nc,d\n'
hoijui
  • 3,615
  • 2
  • 33
  • 41
rhaertel80
  • 8,254
  • 1
  • 31
  • 47
4

I've added to Roee Gavirel's code 2 additional tests that show conclusively that joining lists into strings is not any faster than s += "something", up to Python 3.6. Later versions have different results.

Results:

Python 2.7.15rc1    

Iterations: 100000
format    done in 0.317540168762s
%s        done in 0.151262044907s
list+join done in 0.0055148601532s
str cat   done in 0.00391721725464s
    
Python 3.6.7

Iterations: 100000
format    done in 0.35594654083251953s
%s        done in 0.2868080139160156s
list+join done in 0.005924701690673828s
str cat   done in 0.0054128170013427734s
f str     done in 0.12870001792907715s

Python 3.8.5

Iterations: 100000
format    done in 0.1859891414642334s
%s        done in 0.17499303817749023s
list+join done in 0.008001089096069336s
str cat   done in 0.014998912811279297s
f str     done in 0.1600024700164795s

Code:

from time import time


def _with_cat(i):
    _st = ''
    for i in range(0, i):
        _st += "0"
    return _st


def _with_f_str(i):
    _st = ''
    for i in range(0, i):
        _st = f"{_st}0"
    return _st


def _with_format(i):
    _st = ''
    for i in range(0, i):
        _st = "{}{}".format(_st, "0")
    return _st


def _with_s(i):
    _st = ''
    for i in range(0, i):
        _st = "%s%s" % (_st, "0")
    return _st


def _with_list(i):
    l = []
    for i in range(0, i):
        l.append("0")
    return "".join(l)


def _count_time(name, i, func):
    start = time()
    r = func(i)
    total = time() - start
    print("%s done in %ss" % (name, total))
    return r


iteration_count = 100000

print('Iterations: {}'.format(iteration_count))
r1 = _count_time("format   ", iteration_count, _with_format)
r2 = _count_time("%s       ", iteration_count, _with_s)
r3 = _count_time("list+join", iteration_count, _with_list)
r4 = _count_time("str cat  ", iteration_count, _with_cat)
r5 = _count_time("f str    ", iteration_count, _with_f_str)

if len(set([r1, r2, r3, r4, r5])) != 1:
    print("Not all results are the same!")
Martlark
  • 14,208
  • 13
  • 83
  • 99
  • 1
    Hooray and thank you from the department of "sometimes the simple way is the best way." – Lance Kind Aug 15 '19 at 22:28
  • With later versions of Python 3.8+ list + join is almost twice as fast as s+= – Martlark Nov 22 '21 at 21:31
  • The reason this works (to the extent that it does) is that the reference implementation - since it is using reference-counting for garbage collection - can easily check whether a string has any other references; if it's only used in one place, then modifying it in-place is safe. So Python simply "cheats" and implements a mutable string under the hood that simply denies mutation most of the time. When there are other references to the string, a copy must be made for correct semantics; but in a loop, after the first loop iteration the string *no longer has other references* (do you see why?). – Karl Knechtel Mar 24 '23 at 23:01
  • 1
    This is similar to how `+=` for lists can use the `.extend` logic and not create a new object even though `+` does. Of course, `list` is a mutable type, and `+=` is allowed to modify the list even if it does have other references. – Karl Knechtel Mar 24 '23 at 23:02
3

this link might be useful for concatenation in python

http://pythonadventures.wordpress.com/2010/09/27/stringbuilder/

example from above link:

def g():
    sb = []
    for i in range(30):
        sb.append("abcdefg"[i%7])

    return ''.join(sb)

print g()   

# abcdefgabcdefgabcdefgabcdefgab
Kamlesh Arya
  • 4,864
  • 3
  • 21
  • 28
  • Whilst this may theoretically answer the question, [it would be preferable](http://meta.stackexchange.com/q/8259) to include the essential parts of the answer here, and provide the link for reference. – Joachim Sauer Nov 12 '13 at 10:06
2

Just a test I run on python 3.6.2 showing that "join" still win BIG!

from time import time


def _with_format(i):
    _st = ''
    for i in range(0, i):
        _st = "{}{}".format(_st, "0")
    return _st


def _with_s(i):
    _st = ''
    for i in range(0, i):
        _st = "%s%s" % (_st, "0")
    return _st


def _with_list(i):
    l = []
    for i in range(0, i):
        l.append("0")
    return "".join(l)


def _count_time(name, i, func):
    start = time()
    r = func(i)
    total = time() - start
    print("%s done in %ss" % (name, total))
    return r

iterationCount = 1000000

r1 = _count_time("with format", iterationCount, _with_format)
r2 = _count_time("with s", iterationCount, _with_s)
r3 = _count_time("with list and join", iterationCount, _with_list)

if r1 != r2 or r2 != r3:
    print("Not all results are the same!")

And the output was:

with format done in 17.991968870162964s
with s done in 18.36879801750183s
with list and join done in 0.12142801284790039s
Roee Gavirel
  • 18,955
  • 12
  • 67
  • 94
1

The closest thing Python offers to a mutable string or StringBuffer would probably be a Unicode-type array from the array standard library module. It can be useful in cases where you only want to edit small parts of the string:

modifications = [(2, 3, 'h'), (0, 6, '!')]
n_rows = multiline_string.count('\n')
strarray = array.array('u', multiline_string)
for row, column, character in modifications:
    strarray[row * (n_rows + 1) + column] = character
multiline_string = map_strarray.tounicode()
Jan Šimbera
  • 457
  • 3
  • 16
0

Here is my StringBuffer implementation:

class StringBuffer:
    def __init__(self, s:str=None):
        self._a=[] if s is None else [s]

    def a(self, v):
        self._a.append(str(v))
        return self

    def al(self, v):
        self._a.append(str(v))
        self._a.append('\n')
        return self

    def ts(self, delim=''):
        return delim.join(self._a)

    def __bool__(self): return True

Usage:

sb = StringBuffer('{')
for i, (k, v) in enumerate({'k1':'v1', 'k2': 'v2'}.items()):
    if i > 0: sb.a(', ')
    sb.a('"').a(k).a('": ').a('"').a(v)
sb.a('}')
print(sb.ts('\n'))

Which will output {"k1": "v1, "k2": "v2}.

Timothy C. Quinn
  • 3,739
  • 1
  • 35
  • 47