I have to concatenate many short strings into on a dictionary entry and I realized it was quite slow (roughly quadratic). However, if the strings are concatenated beforehand and then added to the dictionary then the concatenation time is almost linear.
Here is a simple illustration. This functions basically concatenate many strings and create a dictionary with a single entry that contains the concatenated string. The first functions does it directly on the entry "d[key] += str(num)" and the second on a string "out_str += str(num)". Down below the times are printed and can be seen the quadratic and linear behaviour.
I wonder where the overhead is coming from. Thanks!
def method1(loop_count):
""" String concatenation on dictionary entry directly (slow)"""
out_str = ''
d={}
key = 'key'
d[key] = ''
for num in range(loop_count):
d[key] += str(num)
return d
def method2(loop_count):
""" Concatenation 'on string' and then add to dictionary (fast) """
out_str = ''
d={}
key = 'key'
for num in range(loop_count):
out_str += str(num)
d[key] = out_str
return d
def method3(loop_count):
""" Concatenation 'on string' and then add to dictionary (fast) """
out_str = ''
d={}
key = 'key'
out_str = ''.join(str(n) for n in range(loop_count))
d[key] = out_str
return d
from timeit import default_timer as timer
import numpy as np
for p in range(10,20):
t0 = timer()
method1(np.power(2,p))
t1 = timer()
method2(np.power(2,p))
t2 = timer()
method3(np.power(2,p))
t3 = timer()
print("2^{}:\t{:4.2g}\t{:4.2g}\t{:4.2g}".format(p, t1-t0, t2-t1, t3-t2))
in dict += join
2^10: 0.0003 0.0002 0.0002
2^11: 0.00069 0.0004 0.00038
2^12: 0.0017 0.00079 0.00076
2^13: 0.0057 0.0016 0.0015
2^14: 0.021 0.0032 0.0031
2^15: 0.095 0.0065 0.0065
2^16: 0.77 0.013 0.013
2^17: 3.2 0.026 0.027
2^18: 15 0.052 0.052
2^19: 67 0.1 0.11
Note: This is not a question strictly about efficient string concatenation. It is about when string optimizations take place in string concatenation.
Note 2: I added a third method using the "join" idiom and it takes exactly the same time as +=
As the answers suggest this seems to be an optimization issue. Further tests seem to prove this. The code below shows that += reuses many of the strings whether the concatenation in dictionary entry does not:
a=''
for n in range(10):
print(id(a))
a+=str(n)
140126222965424
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126043294720
140126042796464
140126042796464
d={}
d['key']=''
for n in range(10):
print(id(d['key']))
d['key']+=str(n)
140126222965424
140126042643120
140126042643232
140126042643176
140126042643120
140126042643232
140126042643176
140126042643120
140126042761520
140126042761456
I still wonder why is so. Thank you!