0

I have to concatenate many short strings into on a dictionary entry and I realized it was quite slow (roughly quadratic). However, if the strings are concatenated beforehand and then added to the dictionary then the concatenation time is almost linear.

Here is a simple illustration. This functions basically concatenate many strings and create a dictionary with a single entry that contains the concatenated string. The first functions does it directly on the entry "d[key] += str(num)" and the second on a string "out_str += str(num)". Down below the times are printed and can be seen the quadratic and linear behaviour.

I wonder where the overhead is coming from. Thanks!

def method1(loop_count):
    """ String concatenation on dictionary entry directly (slow)"""
    out_str = ''
    d={}
    key = 'key' 
    d[key] = ''
    for num in range(loop_count):
        d[key] += str(num)
    return d

def method2(loop_count):
    """ Concatenation 'on string' and then add to dictionary (fast) """
    out_str = ''
    d={}
    key = 'key' 
    for num in range(loop_count):
        out_str += str(num)

    d[key] = out_str
    return d

def method3(loop_count):
    """ Concatenation 'on string' and then add to dictionary (fast) """
    out_str = ''
    d={}
    key = 'key'

    out_str = ''.join(str(n) for n in range(loop_count))

    d[key] = out_str
    return d

from timeit import default_timer as timer
import numpy as np
for p in range(10,20):
    t0 = timer()
    method1(np.power(2,p))
    t1 = timer()
    method2(np.power(2,p))
    t2 = timer()
    method3(np.power(2,p))
    t3 = timer()
    print("2^{}:\t{:4.2g}\t{:4.2g}\t{:4.2g}".format(p, t1-t0, t2-t1, t3-t2))

        in dict   +=    join
2^10:   0.0003  0.0002  0.0002
2^11:   0.00069 0.0004  0.00038
2^12:   0.0017  0.00079 0.00076
2^13:   0.0057  0.0016  0.0015
2^14:   0.021   0.0032  0.0031
2^15:   0.095   0.0065  0.0065
2^16:   0.77    0.013   0.013
2^17:    3.2    0.026   0.027
2^18:     15    0.052   0.052
2^19:     67     0.1    0.11

Note: This is not a question strictly about efficient string concatenation. It is about when string optimizations take place in string concatenation.

Note 2: I added a third method using the "join" idiom and it takes exactly the same time as +=


As the answers suggest this seems to be an optimization issue. Further tests seem to prove this. The code below shows that += reuses many of the strings whether the concatenation in dictionary entry does not:

a=''
for n in range(10):
    print(id(a))
    a+=str(n)

140126222965424
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126042796464
140126042796464

d={}
d['key']=''
for n in range(10):
    print(id(d['key']))
    d['key']+=str(n)

140126222965424
140126042643120
140126042643232
140126042643176
140126042643120
140126042643232
140126042643176
140126042643120
140126042761520
140126042761456

I still wonder why is so. Thank you!

Gonzalo
  • 1
  • 1
  • 2
  • 3
    Python optimizes some string concatenations. My guess is that "some" does not include `+=` on dictionary items. BTW, you might find it even faster to accumulate them in a list and then join them from there (you'd have to try it). – kindall Oct 31 '18 at 16:53
  • 2
    Possible duplicate of [How to speed up string concatenation in Python?](https://stackoverflow.com/questions/4289777/how-to-speed-up-string-concatenation-in-python) – E.Coms Oct 31 '18 at 16:57
  • 1
    Yes, because you should assume that string concatenation with `+=` is quadratic. There has been some effort at adding optimizations that will not use quadratic-time algorithms under the hood when you do a loop with `+=` with strings, but it is hard for the interpreter to optimize anything but the most obvious instances. In your case, the interpreter is not able to optimize it. But you shouldn't rely on interpreter optimizations, rather, use the *canonical* way of joining many strings: use `''.join` – juanpa.arrivillaga Oct 31 '18 at 19:31
  • Updated with join. The same performace of +=. – Gonzalo Nov 02 '18 at 10:06
  • So it seems that the issue is an optimization that does not enter in with dictionaries. I suspected something like that. Thank you! I wonder if there is any python documentation where this issue is described... – Gonzalo Nov 02 '18 at 10:09

1 Answers1

1

"E.Coms" linked article points to an old mailing list entry: https://mail.python.org/pipermail/python-dev/2004-August/046686.html which talks about code like:

s = ''
for x in y:
  s += some_string(x)

mentioning:

The question is important because the difference in performance is enormous -- we are not talking about 2x or even 10x faster but roughly Nx faster where N is the size of the input data set.

interesting, as I'd assumed that

out_str = ''.join(str(n) for n in range(loop_count))

would have been faster than

out_str = ''
for num in range(loop_count):
    out_str += str(num)

but they have the same performance as far as I can measure. I think I made this assumption because of the repeated messages that str.join() is a "good thing"

not sure if that answers the question, but I found the history interesting!

as indicated by juanpa.arrivillaga below, method1 is slow because storing references to the string elsewhere invalidates the above optimisation. there will also be a O(n) cost to doing the extra dictionary lookups as well, but in timings this small cost is dominated by the O(n^2) work of creating n copies of the string rather than just the amortised O(n) version the optimisation permits

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • 1
    yes, because the interpreter is able to optimize that simple loop. Adding the indirection of the dictionary makes the interpreter unable to optimize it. But you should never be writing code hoping that these interpreter optimizations, which were really put into place because too many people cannot be bothered to learn the correct way to do things. So if you are joining many strings, use `''.join`. – juanpa.arrivillaga Oct 31 '18 at 19:35
  • @juanpa.arrivillaga have updated my answer, hope I explained it usefully – Sam Mason Oct 31 '18 at 19:44
  • "O(n) cost to doing the extra dictionary lookups as well" um, I don't think so. What do you mean? dict lookups in Python are safely assumed to be O(1). Of course, it is possible that certain dictionary structures will fail at this, but it is one of the most optimized data-structures in the language, especially for anything like `str` and `int` objects. Note, *many variable references, like global ones* are *simply `dict` accesses* – juanpa.arrivillaga Oct 31 '18 at 19:44
  • @juanpa.arrivillaga I meant it'll involve O(n) cost for the whole `method1` call, a hash lookup is obviously amortised O(1) – Sam Mason Oct 31 '18 at 19:45
  • I'm not sure what you mean by adding an O(N) cost exactly. It's adding a constant factor to the work being done inside the loop. So, suppose your loop is O(N), it would add something like O(cN) ~ O(N). But this stuff always confuses me, and it's been a while since I've considered it rigorously. – juanpa.arrivillaga Oct 31 '18 at 19:47
  • 1
    I think we're saying the same thing just using different words! if you've read knuth's taocp I think I might be going into that level of detail — basically unnecessary for answering the OP – Sam Mason Oct 31 '18 at 19:53
  • 1
    Agreed, speaking informally, one scales linearly, the other quadratically. – juanpa.arrivillaga Oct 31 '18 at 19:54
  • The issue that juampa.arrivillaga raises is very interesting. Should a programmer relay or not on under-the-hood optimizations? I agree that in general one should write code as optimized as possible without relaying in optimizations. However, I am not sure for this case. String concatenation should be linear. The issue that in python is/was quadratic becomes in part for the unmutable strings and other design decisions that shouldn't have affected its performance. I see the join idiom as a workaround. In this case, once you get the ref from the dict, the interpreter should understand that... – Gonzalo Nov 02 '18 at 10:22
  • ...should understand that a string is being expanded as with += and should apply, not an optimization, but the normal linear way of extending strings. I guess however that this goes down deep the design decisions of python. – Gonzalo Nov 02 '18 at 10:25