-1

I'm analysing the complexity of string concatination for different methods and what I notice.

%%time
sent = ""
word = "hello "
num = 100000
for _ in range(num):
    sent += word

this method takes 20ms

%%time
sent_1 = "".join([word] * num)

the second method takes 4 ms and

%%time
sent_2 = ""
for _ in range(num//5):
    sent_2 = sent_2 + word + word + word + word + word

the third method takes 208 ms

sent==sent_1==sent_2 gives me True

can someone explain why I get such results

  • 2
    "can someone explain why I get such results" What needs explaining? What do you think the results should be like instead, and why? I notice you wrote "time complexity" in the title; did you try using different values of `num`, to assess the big-O complexity of each approach? – Karl Knechtel Mar 06 '23 at 20:01
  • Repeated string concatenation (using the `+` symbol) in Python is very slow. Using `.join()` is much faster. It's that simple. – John Gordon Mar 06 '23 at 20:06
  • Strictly speaking, `str` values are immutable. *But*, CPython has an optimization that detects that `sent + word` is being assigned back to `sent`, and so updates `sent` in-place anyway. – chepner Mar 06 '23 at 20:07
  • @KarlKnechtel I thought that for _ in range(num): sent += word should give the same time as for _ in range(num//5): sent_2 = sent_2 + word + word + word + word + word – Влад Лещук Mar 06 '23 at 20:09
  • Ah, then the question is basically about the special-case optimization of `+=` for strings. This is a common question, but I'm struggling to find a proper canonical. But see e.g. https://stackoverflow.com/questions/2791931. – Karl Knechtel Mar 06 '23 at 23:10

1 Answers1

1

sent2 + word + word + word + word + word creates multiple temporary str objects, roughly equivalent to

t1 = sent2 + word
t2 = t1 + word
t3 = t2 + word
t4 = t3 + word
sent_2= t4 + word

Creating all these objects (and copying the ever-growing values of one temporary value into another) results in a quadratic runtime.

''.join([word] * num), on the other hand, gets all the strings to concatenate at once in a single list, so it can allocate one result string and copy the substring in place in linear time.

The first example,

for _ in range(num):
    sent += word

should be equivalent to repeated concatenation. But, CPython in particular optimizes this by recognizing that nobody can access the value of sent while sent + word is being computed, so it "cheats" and updates sent in place, with no new string being allocated to assign back to sent. This means you still have a linear-time operation, like ''.join, but with extra overhead due to iterating over the string in Python rather than at the C level of the interpreter.

chepner
  • 497,756
  • 71
  • 530
  • 681