Why are f-strings slower than string concatenation when repeatedly adding to a string inside a loop?

Question

I was benchmarking some code for a project with timeit (using a free replit, so 1024MB of memory):

code = '{"type":"body","layers":['

for x, row in enumerate(pixels):
    for y, pixel in enumerate(row):
        if pixel != (0, 0, 0, 0):
            code += f'''{{"offsetX":{-start + x * gap},"offsetY":{start - y * gap},"rot":45,"size":{size},"sides":4,"outerSides":0,"outerSize":0,"team":"{'#%02x%02x%02x' % (pixel[:3])}","hideBorder":1}},'''
    
code += '],"sides":1,"name":"Image"}}

The loop runs for every single pixel inside a given image (not efficient of course, but I haven't implemented anything to reduce loop times yet), so any optimization I can get in the loop is worth it.

I remembered that f-strings are faster than string concatenation as long as you're combining 3+ strings—and as shown, I have a lot more than 3 strings being combined—so I decided to replace the += inside the loop with an f-string and see the improvement.

code = '{"type":"body","layers":['

for x, row in enumerate(pixels):
    for y, pixel in enumerate(row):
        if pixel != (0, 0, 0, 0):
            code = f'''{code}{{"offsetX":{-start + x * gap},"offsetY":{start - y * gap},"rot":45,"size":{size},"sides":4,"outerSides":0,"outerSize":0,"team":"{'#%02x%02x%02x' % (pixel[:3])}","hideBorder":1}},'''
    
code += '],"sides":1,"name":"Image"}}

The results of 500 timeit iterations:

+= took 5.399778672000139 seconds
fstr took 6.91279206800027 seconds

I've rerun this multiple times; the above times are the best f-strings have done so far. Why are f-strings slower in this case?

PS: This is my first time posting a question here. Any suggestions on how to improve my future questions would be greatly appreciated :D

Your code has syntax errors, specifically at `f\'` and the last line - can you update your question so that the code is the actual code that's giving you the problems? — Grismar, Jul 20 '22 at 00:21
Moreover, please also isolate out the image related code, as for benchmarking it is best to reduce the code such that aspects unrelated to the feature being benchmarked is limited to the minimum. — metatoaster, Jul 20 '22 at 00:22
Also, there's already a thread on [string concatenation vs f-string](https://stackoverflow.com/questions/59180574/string-concatenation-with-vs-f-string) here that has the barest minimal code, though it can be argued that particular benchmark is less complicated due to the lack of loops, while this example you have here involves a for loop. — metatoaster, Jul 20 '22 at 00:24
@Grismar Ah forgot to remove the escape from the timeit. Will do — ChromaticPixels, Jul 20 '22 at 00:28
@metatoaster although that is true, OP is apparently aware of the outcomes of that thread and is asking specifically why in this case f-strings would work out to be slower, considering more substitutions are involved. — Grismar, Jul 20 '22 at 00:28
@metatoaster Grismar is correct about your 2nd comment, but I can gladly edit out the previous code :) — ChromaticPixels, Jul 20 '22 at 00:31
I did also add that you have a for-loop here, and I also should add that your example here involves the f-string has a reference to the assignment itself to build a longer string, so to strengthen my agreement that this question is unique. — metatoaster, Jul 20 '22 at 00:32
@jasonharper: Note: To be clear, *C*Python (the reference interpreter) special cases that optimization. Non-reference-counted Python interpreters can't apply that optimization (at least, not outside *very* restrictive circumstances, and I don't think any do). [It's super-brittle](https://stackoverflow.com/a/72568770/364696) even there, and PEP8 specifically says not to rely on it. — ShadowRanger, Jul 20 '22 at 00:37
I'm surprised they take almost the same time. How many nonzero pixels does your image have? — Kelly Bundy, Jul 20 '22 at 01:30

score 5 · Accepted Answer · answered Jul 20 '22 at 00:45

5

So, first off, repeated concatenation in a language with immutable strings is, theoretically, O(n²), while efficiently implemented bulk concatenation is O(n), so both versions of your code are theoretically bad for repeated concatenation. The version that works everywhere with O(n) work is:

code = ['{"type":"body","layers":[']  # Use list of str, not str

for x, row in enumerate(pixels):
    for y, pixel in enumerate(row):
        if pixel != (0, 0, 0, 0):
            code.append(f'''{{"offsetX":{-start + x * gap},"offsetY":{start - y * gap},"rot":45,"size":{size},"sides":4,"outerSides":0,"outerSize":0,"team":"{'#%02x%02x%02x' % (pixel[:3])}","hideBorder":1}},''')  # Append each new string to list
    
code.append('],"sides":1,"name":"Image"}}')
code = ''.join(code)  # Efficiently join list of str back to single str

Your code with += happens to work efficiently enough because of a CPython specific optimization for string concatenation when concatenating to a string with no other living references, but the very first Programming Recommendation in the PEP8 style guide specifically warns against relying on it:

... do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn’t present at all in implementations that don’t use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

Essentially, your original +=-based code benefited from the optimization, and as a result, ended up performing fewer data copies. Your f-string based code did the same work, but in a way that prevented the CPython optimization from applying (building a brand new, increasingly large, str every time). Both approaches are poor form, one of them was just slightly less awful on CPython. When your hot code is performing repeated concatenation, you're already doing the wrong thing, just use a list of str and ''.join at the end.

answered Jul 20 '22 at 00:45

ShadowRanger

143,180
12
188
271

I used to have a list comprehension, but from my current testing, list comprehension breaks to memory first as the images get bigger (replit hard memory limit thing)—and although I care about speed, I care about memory more due to the limit. I probably should've specified that list comprehensions aren't something I'm looking for, but seeing as this is likely the answer to my question, I'll mark it as correct. If using a list instead of a list comprehension is more memory efficient, please do tell me. – ChromaticPixels Jul 20 '22 at 00:50
From testing I've done way back when, you are indeed correct that lists are *faster* than strings for this. It's just memory. – ChromaticPixels Jul 20 '22 at 00:53
@ChromaticPixels: `list` or listcomp would be roughly equivalent, memory-wise. You might minimize overhead of paying `str` object overhead while still getting most of the benefit by occasionally condensing the `list` down midway, e.g. `if len(code) > 1000: code = [''.join(code)]` to condense it down every 1000 elements (tweak as needed). You might also try building it by using `io.StringIO` and `write`ing each string to the object then using `.getvalue()` to retrieve the full string at the end. I can't guarantee anything, but it looks like it's intended to handle this acceptably. – ShadowRanger Jul 20 '22 at 00:58
@KellyBundy: The optimization has the greatest effect when the strings are on the smaller end (more `realloc`s can occur in place before it has to move it), and it's masked when you do more work to produce each string (this code isn't particularly high, but it's something). The optimization doesn't completely undo the costs of Schlemiel the Painter's algorithms (it can't, it never actually overallocates by a multiple of the current size, so it can't get true amortized `O(1)` per-concat costs), just gets you something better than `O(n)` in many common cases. The filter might trim a lot too. – ShadowRanger Jul 20 '22 at 01:57
@ChromaticPixels You already have two dimensions, could do nested list comps (join each row's strings and join the row-strings). – Kelly Bundy Jul 20 '22 at 02:03
@ShadowRanger I rather think it's the opposite, the effect is greatest for *larger* strings. I deleted my comment because I had overlooked their 500 iterations, so their time wasn't 5 seconds but 0.01 seconds, which means much fewer additions than I had originally thought. I believe it *can* get overallocations by a constant factor and be true O(1). As far as I know, the C/OS(?) allocator decides that, and see the repeated [25% increases](https://stackoverflow.com/questions/44487537/why-does-naive-string-concatenation-become-quadratic-above-a-certain-length#comment113882185_44488420) on colab. – Kelly Bundy Jul 20 '22 at 02:18
@KellyBundy: To be clear, the case I'm assuming is most efficient is large (at least a few KB) string on the left, small (a handful of bytes) strings being concatenated repeatedly to it (so when the round-off error leaves a 4KB page mostly free, the next 4096/len small string concats are free). I've never seen a `realloc` that does amortized growth, and it seems like a terrible idea (most libraries that wrap it assume they get nothing but some allocator round-off error, which would be maybe a full memory page in most cases, so having `realloc` overallocate itself means pointless memory waste). – ShadowRanger Jul 20 '22 at 05:44
@ShadowRanger Do you think that that experiment on colab doesn't show that realloc overallocates by 25% there? Or do you agree that it does? I also tried doing that but adding 1000 chars at a time so I can go further. Until 200~400 MB (differs from run to run), the string object consistently gets a new address after it has become 25% longer than it had been at the previous address change. After that, up to 2 GB (I didn't try further) it's a random(?) mix of mostly 25% increases and 1~8 kB increases. – Kelly Bundy Jul 20 '22 at 13:44
@KellyBundy: It's wholly possible colab is doing that. It's a very weird behavior that would wastefully interfere with manual amortized growth (e.g. if your homegrown vector-like thing grows by 50% when it needs to expand, it never benefits from the 25% bump, and it consumes, at least in virtual address space, that extra space for nothing), but I'm guessing in the research world they're supporting, being a bit wasteful on memory use to make up for poorly written code might be considered worth it. *shrugs* – ShadowRanger Jul 20 '22 at 16:30

Grismar · Answer 2 · 2022-07-20T04:13:58.863

The answer is in the comments and the links provided.

You'll find that this implementation (of your original example) performs far better:

img = Image.open(r'F:\Project\Python\sandbox_310\test.png')
pixels = list(img.getdata())
width, height = img.size

pixels = tuple(pixels[i * width:(i + 1) * width] for i in range(height))

start = 6
gap, size = (start * 2) / (width - 1), 0.1475 * (64 / width) * (start / 6)

data = [(-start + x * gap, start - y * gap, '#%02x%02x%02x' % (pixel[:3]))
        for x, row in enumerate(pixels) for y, pixel in enumerate(row)]

template = f'''{{{{"offsetX":{{}},"offsetY":{{}},"rot":45,"size":{size},"sides":4,"outerSides":0,"outerSize":0,"team":"{{}}","hideBorder":1}}}},'''
code = '{"type":"body","layers":[' + ''.join([template.format(*t) for t in data]) + '],"sides":1,"name":"Image"}}'

Edit: user @kellybundy asked how much faster:

from PIL import Image
from timeit import timeit

img = Image.open(r'F:\Project\Python\sandbox_310\test.png')
pixels = list(img.getdata())
width, height = img.size

pixels = tuple(pixels[i * width:(i + 1) * width] for i in range(height))
start = 6
gap, size = (start * 2) / (width - 1), 0.1475 * (64 / width) * (start / 6)


def f_sol():
    data = [(-start + x * gap, start - y * gap, '#%02x%02x%02x' % (pixel[:3]))
            for x, row in enumerate(pixels) for y, pixel in enumerate(row)]

    template = f'''{{{{"offsetX":{{}},"offsetY":{{}},"rot":45,"size":{size},"sides":4,"outerSides":0,"outerSize":0,"team":"{{}}","hideBorder":1}}}},'''
    code = '{"type":"body","layers":[' + ''.join([template.format(*t) for t in data]) + '],"sides":1,"name":"Image"}}'
    return code


def f_op():
    code = '{"type":"body","layers":['

    for x, row in enumerate(pixels):
        for y, pixel in enumerate(row):
            if pixel != (0, 0, 0, 0):
                code += f'''{{"offsetX":{-start + x * gap},"offsetY":{start - y * gap},"rot":45,"size":{size},"sides":4,"outerSides":0,"outerSize":0,"team":"{'#%02x%02x%02x' % (pixel[:3])}","hideBorder":1}},'''

    code += '],"sides":1,"name":"Image"}}'
    return code


assert f_sol() == f_op()
print(timeit(f_sol, number=10))
print(timeit(f_op, number=10))

Output:

1.7875813000027847
47.82409440000265

So, more than 25x faster, which is why I didn't time them to begin with.

How far better does it perform? What times do you get with their way and yours? — Kelly Bundy, Jul 20 '22 at 01:29
I haven't measured it because it was a day and night difference - just dropped it here underneath the already accepted answer for OP to try if they want. — Grismar, Jul 20 '22 at 03:53

Why are f-strings slower than string concatenation when repeatedly adding to a string inside a loop?

2 Answers2