Why does naive string concatenation become quadratic above a certain length?

Question

Building a string through repeated string concatenation is an anti-pattern, but I'm still curious why its performance switches from linear to quadratic after string length exceeds approximately 10 ** 6:

# this will take time linear in n with the optimization
# and quadratic time without the optimization
import time
start = time.perf_counter()
s = ''
for i in range(n):
    s += 'a'
total_time = time.perf_counter() - start
time_per_iteration = total_time / n

For example, on my machine (Windows 10, python 3.6.1):

for 10 ** 4 < n < 10 ** 6, the time_per_iteration is almost perfectly constant at 170±10 µs
for 10 ** 6 < n, the time_per_iteration is almost perfectly linear, reaching 520 µs at n == 10 ** 7.

Linear growth in time_per_iteration is equivalent to quadratic growth in total_time.

The linear complexity results from the optimization in the more recent CPython versions (2.4+) that reuse the original storage if no references remain to the original object. But I expected the linear performance to continue indefinitely rather than switch to quadratic at some point.

My question is based made on this comment. For some odd reason running

python -m timeit -s"s=''" "for i in range(10**7):s+='a'"

takes incredibly long time (much longer than quadratic), so I never got the actual timing results from timeit. So instead, I used a simple loop as above to obtain performance numbers.

Update:

My question might as well have been titled "How can a list-like append have O(1) performance without over-allocation?". From observing constant time_per_iteration on small-size strings, I assumed the string optimization must be over-allocating. But realloc is (unexpectedly to me) quite successful at avoiding memory copy when extending small memory blocks.

Use the full timiet module so you can control the `number` of times the code is executed: `timeit.timeit('for i in range(10**7): s += "a"', 's=""', number=1)` I believe the default number is 1e6, so that is why it's not terminating — juanpa.arrivillaga, Jun 11 '17 at 19:02
Also, for what it's worth, that took about 1.1 seconds on my machine, and switching to 10**8 took about 12 seconds, so it seems that on my Python 3.5.2 the optimization still holds for big numbers. — juanpa.arrivillaga, Jun 11 '17 at 19:06
@juanpa.arrivillaga the default when called from command line (which is what I did) should be auto-determined based on how long each execution takes. And I'm running python 3.6.1, so something is weird. Can you try my loop with `time.perf_counter()` instead of `timeit`? And what's your platform? — max, Jun 11 '17 at 19:14
Yep, I tried it. I'm getting linear performance. Given the answers below, perhaps the total memory available on the machine affects the probability of the system allocator finding unused memory adjacent to the string? I'm running with about 8 gigabytes of free memory... — juanpa.arrivillaga, Jun 11 '17 at 19:18
@juanpa.arrivillaga I doubt it's the amount of free memory, since I have 20 GB free out of 32 GB total. Which OS do you use? And how high can you go before it hits quadratic? :) — max, Jun 11 '17 at 19:26

score 16 · Accepted Answer · answered Jun 11 '17 at 19:11

16

In the end, the platform C allocators (like malloc()) are the ultimate source of memory. When CPython tries to reallocate string space to extend its size, it's really the system C realloc() that determines the details of what happens. If the string is "short" to begin with, chances are decent the system allocator finds unused memory adjacent to it, so extending the size is just a matter of the C library's allocator updating some pointers. But after repeating this some number of times (depending again on details of the platform C allocator), it will run out of space. At that point, realloc() will need to copy the entire string so far to a brand new larger block of free memory. That's the source of quadratic-time behavior.

Note, e.g., that growing a Python list faces the same tradeoffs. However, lists are designed to be grown, so CPython deliberately asks for more memory than is actually needed at the time. The amount of this overallocation scales up as the list grows, enough to make it rare that realloc() needs to copy the whole list-so-far. But the string optimizations do not overallocate, making cases where realloc() needs to copy far more frequent.

answered Jun 11 '17 at 19:11

Tim Peters

67,464
13
126
132

Ahhh it doesn't over-allocate! I somehow assumed it does; but it makes sense: this is just an optimization that's not guaranteed by the language anyway, so it's not as sophisticated as `list.append()`. Out of curiosity though, apart from memory usage, is there any downside to over-allocating in string optimizations? – max Jun 11 '17 at 19:30
2

Strings are immutable so it's a total waste of space if you overallocate. It's just an "implementation detail" that this (quite uncommon) way of string concatenation _could be_ optimized. – MSeifert Jun 11 '17 at 19:45
Don't you still have to copy the memory? Allocation hardly seems like the issue here. – user541686 Jun 12 '17 at 00:19
@Mehrdad It depends: if `realloc` sees there's some "free space" behind the current array it can (try to) claim that space and doesn't need to copy. If there's no "free space" after the array it definetly needs to copy the complete array to a memory location where the "bigger array" can fit in. – MSeifert Jun 12 '17 at 01:13
@MSeifert: But do you have *any* shred of evidence to believe `realloc` is called, let alone in a useful way? Strings are immutable; the interpreter can't just randomly change the length of the first one to append the second one to it. The way to do that is to first prove the refcount of the first one has gone to zero. That would be extremely difficult in a multithreading scenario since the interpreter can stop arbitrarily long in-between the relevant bytecode instructions and might want to use the memory for something else. If you actually think it through I don't think it makes sense at all. – user541686 Jun 12 '17 at 01:32
2

@Mehrdad [Yes](https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L996) and [yes](https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L932) (at least for CPython and the question is tagged CPython). There are lots of safety checks that make sure it doesn't happen in case _it could go wrong_. But the example in the question is specifically designed to "go into the realloc path". – MSeifert Jun 12 '17 at 01:35
@MSeifert: Thanks a ton, that's awesome. I have to say I'm really impressed by CPython there. But I have to say it means CPython probably isn't doing what someone [said](https://mail.python.org/pipermail/python-list/2007-September/421966.html) a while ago, i.e. executing 100 instructions at a time and then switching threads -- otherwise it would need to be allocate construct the new string before assigning to the variable containing the old one, right? (`INPLACE_ADD` comes before `STORE_FAST`...) Although I suppose it could look peek ahead to see if it can execute 1 extra instruction first? – user541686 Jun 12 '17 at 01:52
1

@Mehrdad, note that all the references supplied in these answers emphasize that this optimization is specific to CPython, and is very limited. That's key to understanding your objection wrt threads: in CPython, the GIL (Global Interpreter Lock) guarantees only one thread is running CPython internals during the bits of the C implementation that decide whether resizing is safe (and, if so, go on to do the realloc). That's all a single atomic operation so far as threads in CPython are concerned. – Tim Peters Jun 12 '17 at 01:54
@TimPeters: But I'm also only talking about CPython in my comments. I thought it executes 100 instructions at a time before switching threads, barring explicit waits (see the comment). – user541686 Jun 12 '17 at 01:55
@Mehrdad, I don't see the relevance. CPython's thread-switching algorithm varies across releases. Since 3.2, it uses a time-based approach rather than a PVM opcode-count-based approach (https://docs.python.org/3.6/library/sys.html#sys.setswitchinterval). Regardless, it seems irrelevant to me (the GIL is supremely relevant, but not the algorithm used to decide when to switch threads). – Tim Peters Jun 12 '17 at 02:04
TimPeters: It was relevant in terms of constraining what optimizations I thought the CPython interpreter could actually do (as I explained above), and I wasn't aware that it's varied between releases. Otherwise yes, I agree it's not relevant. :) Also, I think you or @MSeifert (or both) should add those source code links to your answers, it's great to prove what Python actually does, not merely what it could potentially do! – user541686 Jun 12 '17 at 02:05

Leon · Answer 2 · 2017-06-11T19:25:42.653

4

[XXXXXXXXXXXXXXXXXX............]
 \________________/\__________/
     used space      reserved
                      space

When growing a contiguous array data structure (illustrated above) through appending to it, linear performance can be achieved if the extra space reserved while reallocating the array is proportional to the current size of the array. Obviously, for large strings this strategy is not followed, most probably with the purpose of not wasting too much memory. Instead a fixed amount of extra space is reserved during each reallocation, resulting in quadratic time complexity. To understand where the quadratic performance comes from in the latter case, imagine that no overallocation is performed at all (which is the boundary case of that strategy). Then at each iteration a reallocation (requiring linear time) must be performed, and the full runtime is quadratic.

edited Jun 11 '17 at 19:25

answered Jun 11 '17 at 19:11

Leon

31,443
4
72
97

`most probably with the purpose of not wasting too much memory`: I guess overallocating is only good if there's a high probability the growth will continue in the future (to justify the use of extra memory). And while the growth will very often continue after `list.append()`, it was probably deemed to be unlikely that string concatenation will repeat many times, this being an anti-pattern? – max Jun 11 '17 at 19:28
1

@max, CPython never _intended_ to resize strings in-place. That optimization was added more than a dozen years after Python was first released. Some core developers were opposed to it even then. Note that the _ability_ to overallocate costs memory even if it's never used(!): then the object needs to store not only the current size, but also the amount currently allocated. Blowing an extra 8 bytes to store the character `"k"` just in case someone wants to extend it later is not attractive ;-) Multiply that by millions (programs slinging millions of strings are pretty common). – Tim Peters Jun 12 '17 at 03:29

MSeifert · Answer 3 · 2017-06-12T01:38:24.483

TL;DR: Just because string concatenation is optimized under certain circumstances doesn't mean it's necessarily O(1), it's just not always O(n). What determines the performance is ultimatly your system and it could be smart (beware!). Lists that "garantuee" amortized O(1) append operations are still much faster and better at avoiding reallocations.

This is an extremly complicated problem, because it's hard to "measure quantitativly". If you read the announcement:

String concatenations in statements of the form s = s + "abc" and s += "abc" are now performed more efficiently in certain circumstances.

If you take a closer look at it then you'll note that it mentions "certain circumstances". The tricky thing is to find out what these certain cirumstances are. One is immediatly obvious:

If something else holds a reference to the original string.

Otherwise it wouldn't be safe to change s.

But another condition is:

If the system can do the reallocation in O(1) - that means without needing to copy the contents of the string to a new location.

That's were it get's tricky. Because the system is responsible for doing a reallocation. That's nothing you can control from within python. However your system is smart. That means in many cases you can actually do the reallocation without needing to copy the contents. You might want to take a look at @TimPeters answer, that explains some of it in more details.

I'll approach this problem from an experimentalists point of view.

You can easily check how many reallocations actually need a copy by checking how often the ID changes (because the id function in CPython returns the memory adress):

changes = []
s = ''
changes.append((0, id(s)))
for i in range(10000):
    s += 'a'
    if id(s) != changes[-1][1]:
        changes.append((len(s), id(s)))

print(len(changes))

This gives a different number each run (or almost each run). It's somewhere around 500 on my computer. Even for range(10000000) it's just 5000 on my computer.

But if you think that's really good at "avoiding" copies you're wrong. If you compare it to the number of resizes a list needs (lists over-allocate intentionally so append is amortized O(1)):

import sys

changes = []
s = []
changes.append((0, sys.getsizeof(s)))
for i in range(10000000):
    s.append(1)
    if sys.getsizeof(s) != changes[-1][1]:
        changes.append((len(s), sys.getsizeof(s)))

len(changes)

That only needs 105 reallocations (always).

I mentioned that realloc could be smart and I intentionally kept the "sizes" when the reallocs happened in a list. Many C allocators try to avoid memory fragmentation and at least on my computer the allocator does something different depending on the current size:

# changes is the one from the 10 million character run

%matplotlib notebook   # requires IPython!

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(1)
ax = plt.subplot(111)

#ax.plot(sizes, num_changes, label='str')
ax.scatter(np.arange(len(changes)-1), 
           np.diff([i[0] for i in changes]),   # plotting the difference!
           s=5, c='red',
           label='measured')
ax.plot(np.arange(len(changes)-1), 
        [8]*(len(changes)-1),
        ls='dashed', c='black',
        label='8 bytes')
ax.plot(np.arange(len(changes)-1), 
        [4096]*(len(changes)-1),
        ls='dotted', c='black',
        label='4096 bytes')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('x-th copy')
ax.set_ylabel('characters added before a copy is needed')
ax.legend()
plt.tight_layout()

Note that the x-axis represents the number of "copies done" not the size of the string!

That's graph was actually very interesting for me, because it shows clear patterns: For small arrays (up to 465 elements) the steps are constant. It needs to reallocate for every 8 elements added. Then it needs to actually allocate a new array for every character added and then at roughly 940 all bets are off until (roughly) one million elements. Then it seems it allocates in blocks of 4096 bytes.

My guess is that the C allocator uses different allocation schemes for differently sized objects. Small objects are allocated in blocks of 8 bytes, then for bigger-than-small-but-still-small arrays it stops to overallocate and then for medium sized arrays it probably positions them where they "may fit". Then for huge (comparativly speaking) arrays it allocates in blocks of 4096 bytes.

I guess the 8byte and 4096 bytes aren't random. 8 bytes is the size of an int64 (or float64 aka double) and I'm on a 64bit computer with python compiled for 64bits. And 4096 is the page size of my computer. I assume there are lots of "objects" that need have these sizes so it makes sense that the compiler uses these sizes because it could avoid memory fragmentation.

You probably know but just to make sure: For O(1) (amortized) append behaviour the overallocation must depend on the size. If the overallocation is constant it will be O(n**2) (the greater the overallocation the smaller the constant factor but it's still quadratic).

So on my computer the runtime behaviour will be always O(n**2) except for lengths (roughly) 1 000 to 1 000 000 - there it really seems to undefined. In my test run it was able to add many (ten-)thousand elements without ever needing a copy so it would probably "look like O(1)" when timed.

Note that this is just my system. It could look totally different on another computer or even with another compiler on my computer. Don't take these too seriously. I provided the code to do the plots yourself, so you can analyze your system yourself.

You also asked the question (in the comments) if there would be downsides if you over-allocate strings. That's really easy: Strings are immutable. So any overallocated byte is wasting ressources. There are only a few limited cases where it really does grow and these are considered implementation details. The developers probably don't throw away space to make implementation details perform better, some python developers also think that adding this optimization was a bad idea.

Alex's speech is amazing :) As for the number of reallocations, I still think it's really impressively small. 105 vs 5000 is not a big deal because it still manages to avoid materially impacting the time to incrementally build up a string character by character: it's still linear rather than quadratic as it "should" be. But either way, I agree it's an absolutely horrible practice to rely on it, so I was interested merely for education purposes. Re: intelligent realloc that adjusts to memory request patterns: cool! Though I suppose it's not done yet in any mainstream OS (win/linux)? — max, Jun 11 '17 at 22:06
@MSeifert, just to drive home how system-dependent this is, on my Win10 box just now under Python 3.6.1, your string loop reallocated about 180 times for 10 thousand iterations, and about 2400 for 10 million iterations.(versus about 50 and 105 times for the list loop). — Tim Peters, Jun 11 '17 at 22:08
@max I updated the post. After some looking around "smart memory allocation based on past allocation history" is more a theoretical concept than actual implementation. :( I also included a graph that shows why it's quadratic for strings greater than 1 million elements on my computer (and it's pretty to look at I think xD). — MSeifert, Jun 12 '17 at 01:01
@MSeifert, FYI, if you're running a current version of Python 3, "almost all" memory allocations are given to CPython's own "small block allocator" (obmalloc.c) first. If the request doesn't exceed 512 bytes, obmalloc handles it itself. And obmalloc then rounds the size up to the nearest multiple of 8 bytes. That's why you see "jump at every 8" so long as the string remains short. Above 512 bytes, you're seeing what your platform C library does (malloc() and realloc()). — Tim Peters, Jun 12 '17 at 03:04
Awesome answer! Years later, on Python 3.9 on Windows 10, I get pretty much the same patterns as in your plot, except 16 bytes instead of 8. But [on Colab](https://colab.research.google.com/drive/1Yx8CDBMGSdP271uP8-M5Bjd52fSdjkNN?usp=sharing) it looks totally different: Overallocate by 8 bytes until length 464, then by about 25% all the way up to length ten million. — Stefan Pochmann, Oct 17 '20 at 14:02

Why does naive string concatenation become quadratic above a certain length?

3 Answers3

Linked