How is the s=s+c string concat optimization decided?

Question

Short version: If s is a string, then s = s + 'c' might modify the string in place, while t = s + 'c' can't. But how does the operation s + 'c' know which scenario it's in?

Long version:

t = s + 'c' needs to create a separate string because the program afterwards wants both the old string as s and the new string as t.

s = s + 'c' can modify the string in place if s is the only reference, as the program only wants s to be the extended string. CPython actually does this optimization, if there's space at the end for the extra character.

Consider these functions, which repeatedly add a character:

def fast(n):
    s = ''
    for _ in range(n):
        s = s + 'c'
        t = s
        del t

def slow(n):
    s = ''
    for _ in range(n):
        t = s + 'c'
        s = t
        del t

Benchmark results with n = 100_000 (Try it online!):

fast :   9 ms    9 ms    9 ms    9 ms   10 ms 
slow : 924 ms  927 ms  931 ms  933 ms  945 ms

Note that the extra t = s or s = t makes both variables equivalent references to the string and then del t leaves only s, so at the next loop iteration, s is again the only reference to the string. Thus the only difference between the two functions is the order in which s + 'c' is assigned to s and t.

Let's also disassemble the bytecode. I marked the only three differences with != in the middle. As expected, only the variables for STORE_FAST and LOAD_FAST differ. But up to and including the BINARY_ADD, the bytecode is identical. So how does the BINARY_ADD know whether to optimize or not?

      import dis                                 import dis
      dis.dis(fast)                              dis.dis(slow)
---------------------------------------------------------------------------
    0 LOAD_CONST     1 ('')                    0 LOAD_CONST     1 ('')
    2 STORE_FAST     1 (s)                     2 STORE_FAST     1 (s)
                                                                 
    4 LOAD_GLOBAL    0 (range)                 4 LOAD_GLOBAL    0 (range)
    6 LOAD_FAST      0 (n)                     6 LOAD_FAST      0 (n)
    8 CALL_FUNCTION  1                         8 CALL_FUNCTION  1
   10 GET_ITER                                10 GET_ITER      
>> 12 FOR_ITER      18 (to 32)             >> 12 FOR_ITER      18 (to 32)
   14 STORE_FAST     2 (_)                    14 STORE_FAST     2 (_)
                                                               
   16 LOAD_FAST      1 (s)                    16 LOAD_FAST      1 (s)
   18 LOAD_CONST     2 ('c')                  18 LOAD_CONST     2 ('c')
   20 BINARY_ADD                              20 BINARY_ADD    
   22 STORE_FAST     1 (s)        !=          22 STORE_FAST     3 (t)
                                                               
   24 LOAD_FAST      1 (s)        !=          24 LOAD_FAST      3 (t)
   26 STORE_FAST     3 (t)        !=          26 STORE_FAST     1 (s)
                                                               
   28 DELETE_FAST    3 (t)                    28 DELETE_FAST    3 (t)
   30 JUMP_ABSOLUTE 12                        30 JUMP_ABSOLUTE 12
>> 32 LOAD_CONST     0 (None)              >> 32 LOAD_CONST     0 (None)
   34 RETURN_VALUE                            34 RETURN_VALUE

My guess is there's some JIT optimization happening *after* the Python bytecode gets assembled. Some low-level system akin to LLVM is looking at the bytecode, deciding that something can be done, and doing it transparently either just after compilation or just before running. — Silvio Mayolo, Sep 06 '21 at 19:45
Binary_add doesn’t know, the c code it is calling performs the optimization, because it sees the reference count (and thus knows it is safe) — ead, Sep 06 '21 at 19:51
It's probably worth restricting this to a specific version. There are some very hefty optimizations upcoming (specializations for opcodes to specific built-in types are already in the CPython repo) but they do not apply to the version with which you have timed this - and frankly after a look through the 3.8 source code it seems indeed surprising (and interesting!) that this works. — MisterMiyagi, Sep 06 '21 at 19:52
@ead But doesn't that C code see the same reference count in both versions? — no comment, Sep 06 '21 at 19:55
The magic happens in `string_concatenate()` (in Python/ceval.c), which in turn gets called for BINARY_ADD of two strings. It looks ahead at the next bytecode instruction, and only considers this in-place modification when that instruction is a STORE to the first operand of the addition. — jasonharper, Sep 06 '21 at 19:59
@MisterMiyagi As far as I know, this optimization was introduced in Python 2.4, and it's still there in Python 3.9.6, so pretty much in all versions currently in use. Do you think I should restrict it for future readers? (If so: how? A note in the question?) — no comment, Sep 06 '21 at 20:04
@jasonharper Sounds like a good answer (and the question got reopened), if you could link to those code parts as well. I can't find string_concatenate in ceval.c or anywhere else. — no comment, Sep 06 '21 at 20:27
@uhoh No, it doesn't. This question starts at already knowing the mechanism exists but asking how it works. — MisterMiyagi, Jun 13 '22 at 05:19
@MisterMiyagi Thanks; since the linked question has a vote to close as duplicate and I don't think it's good to divert readers here and away from those answers, perhaps neither should be closed as duplicate of the other? — uhoh, Jun 13 '22 at 05:54
@uhoh Thanks for pointing out the other dupe vote. I've left a comment on the other question as well now. — MisterMiyagi, Jun 13 '22 at 07:31

Tim Peters · Accepted Answer · 2021-09-08T00:15:47.330

14

Here's the code in question, from the Python 3.10 branch (in ceval.c, and called from the same file's implementation of the BINARY_ADD opcode). As @jasonharper noted in a comment, it peeks ahead to see whether the result of the BINARY_ADD will next be bound to the same name from which the left-hand addend came. In fast(), it is (operand came from s and result stored into s), but in slow() it isn't (operand came from s but stored into t).

There's no guarantee this optimization will persist, though. For example, I noticed that your fast() is no faster than your slow() on the current development CPython main branch (which is the current work-in-progress toward an eventual 3.11 release).

Should people rely on this?

As noted, there's no guarantee this optimization will persist. "Serious" Python programmers should know better than to rely on dodgy CPython-specific tricks, and, indeed, PEP 8 explicitly warns against relying on this specific one:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).

For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b ...

edited Sep 08 '21 at 00:15

answered Sep 06 '21 at 20:41

Tim Peters

67,464
13
126
132

Thanks, looking into the details now. In `main`, did `fast` become slow or did `slow` become fast? – no comment Sep 06 '21 at 20:58
`fast()` became slow - the special-case code checking for Unicode operands no longer exists in `main`'s `BINARY_ADD` implementation. Nothing comes "for free": that check marginally slowed _every_ add, including things like `1 + 2`. 3.11 will start taking stabs at rewriting opcodes on the fly, to adapt to types discovered at runtime. Work in progress. – Tim Peters Sep 06 '21 at 21:03
Yeah, I've been wondering whether it slows down other things. So, does that mean tons of old code building strings this way will suddenly become slow, or is that work in progress likely going to reintroduce the optimization? – no comment Sep 06 '21 at 21:10
Don't know. See edit just now for more info. – Tim Peters Sep 06 '21 at 21:20

How is the s=s+c string concat optimization decided?

1 Answers1

Should people rely on this?

Linked