102

Strings in Python are immutable, which means the value cannot be changed. However, when appending to the string in the following example, it looks like the original string memory is modified since the id remains the same:

>>> s = 'String'
>>> for i in range(5, 0, -1):
...     s += str(i)
...     print(f"{s:<11} stored at {id(s)}")
... 
String5     stored at 139841228476848
String54    stored at 139841228476848
String543   stored at 139841228476848
String5432  stored at 139841228476848
String54321 stored at 139841228476848

Conversely, in the following example, the id changes:

>>> a = "hello"
>>> id(a)
139841228475760
>>> a = "b" + a[1:]
>>> print(a)
bello
>>> id(a)
139841228475312
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Master_Roshy
  • 1,245
  • 3
  • 12
  • 19
  • Who said that `str` is `immutable`? – Mohamad Ghaith Alzin Jun 10 '22 at 03:10
  • 8
    @MohamadGhaithAlzin: The [docs](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str), for one: "Strings are immutable sequences of Unicode code points." – user2357112 Jun 10 '22 at 03:11
  • 1
    The optimization you're seeing here breaks string immutability. The devs either didn't think about it or decided the minor violation of immutability was worth it to stop people from constantly complaining about the performance of this kind of loop. – user2357112 Jun 10 '22 at 03:13
  • 3
    `The standard wisdom is that Python strings are immutable. You can't change a string's value, only the reference to the string.` [continue reading here](https://austinhenley.com/blog/pythonstringsaremutable.html#:~:text=The%20standard%20wisdom%20is%20that,the%20reference%20to%20the%20string.) – Mohamad Ghaith Alzin Jun 10 '22 at 03:14
  • 9
    @user2357112: They thought about it, and they only do it when the mutability is undetectable by anything other than silly `id` checks of this sort. If you add `s2 = s` after the `s += str(i)` line, you'll see the `id` change all the time, because now that the `str` is visible through multiple aliases, they can't use the optimization. – ShadowRanger Jun 10 '22 at 03:18
  • A string variable is actually an object. the string value of the object is stored in a different memory address. What you have printed is the id of the object, not the string itself. The string is actually being copied. – chouyangv3 Jun 10 '22 at 03:24
  • 1
    @chouyangv3: You're mistaken. CPython stores the core data of the string in what amounts to a flexible array member at the end of the struct (it can also store other copies of the data in separate arrays, but the canonical representation is always allocated inline, in the same allocation as the struct itself); if the string is actually copied to a new object, the `id` changes. The optimization in CPython sometimes lets it avoid that copy by `realloc`ing in place when it can, if the mutation is not otherwise detectable. – ShadowRanger Jun 10 '22 at 03:27
  • @ShadowRanger So it acts like golang slices? I'm not familiar with python, sorry. – chouyangv3 Jun 10 '22 at 03:50
  • 2
    @chouyangv3: You need to understand C to know what the CPython reference interpreter is doing here, specifically [flexible array members](https://en.wikipedia.org/wiki/Flexible_array_member) (which were standardized in C99, but you could simulate them in any version of C either by putting a length 1 array at the end of a struct and choosing to allocate more than just `sizeof(thestruct)`, or just by allocating extra and casting a pointer to the byte after the struct to the correct type; old `str` did the former, new `str` [with variable width characters] does the latter). – ShadowRanger Jun 10 '22 at 03:58
  • @chouyangv3: You can see the description of the struct layout [in the internal header defining the multiple structs that can implement `str` in modern Python](https://github.com/python/cpython/blob/b2694ab46997746eae7e913738623b39f6334473/Include/cpython/unicodeobject.h#L46). – ShadowRanger Jun 10 '22 at 03:59
  • 18
    @user2357112 Why do you say it breaks immutability? All we see is that the object afterwards has the same address that the object beforehand had. That doesn't mean that they're the same object. – Kelly Bundy Jun 10 '22 at 03:59
  • 3
    @KellyBundy: It actually does mean that. An object's `id` is temporally unique (while an object possesses a specific `id`, no other object can have the same `id`). `s += str(i)` for immutable types is defined to produce the concatenated value of `s` and `str(i)` *first*, then replace `s` *after*. Since `s` still exists when the concatenation occurs, the concatenated value would not be allowed to have the same `id` (because `s` already has it). CPython is cheating when it does this, but it's a harmless sort of cheat (improves performance, only observable effect otherwise is the `id` quirk). – ShadowRanger Jun 10 '22 at 04:03
  • 1
    @ShadowRanger How do you know Python doesn't create a new object, then deletes the old one, then moves the new one to the old address? – Kelly Bundy Jun 10 '22 at 04:05
  • 2
    @KellyBundy: You can hypothesize all sorts of nuts things, but when there is zero reason to do it (it would be the *opposite* of the optimization CPython actually does, doubling the work instead of reducing it), you can rule insanity like that out. Plus, I've read the source code, for multiple releases of CPython; it's definitely doing that (the exact mechanics have changed over time, but it's all fundamentally doing the same optimization). :-) – ShadowRanger Jun 10 '22 at 04:15
  • 7
    @ShadowRanger Yes, I also know it's doing that. But you're rather proving my point: You know because you've read the implementation. In my view, immutability is a *Python* concept, and it doesn't matter what CPython does internally. If you need to refer to the CPython implementation to argue that it's still the same object, but Python doesn't say it's the same object, that doesn't count. – Kelly Bundy Jun 10 '22 at 06:26
  • 6
    @KellyBundy: See my comments on Mark's answer below; I went back to the data model to verify. The language spec itself imposes an operation ordering guarantee and an `id` consistency guarantee, that, in combination, are definitely broken by this optimization (in the sense that it proves a violation of `str` immutability). Your "copy it back to the old `str`s memory" suggestion (which is itself based on a CPython optimization detail) isn't allowed, because by the language spec, both objects must briefly coexist, and the `id` of both must be constant throughout their lifetime. – ShadowRanger Jun 10 '22 at 10:59
  • @ShadowRanger Looks convincing, yes. I don't think `x = x.__add__(y)` is allowed to drop the old reference before calling the function. Fun fact: if you do use `s = s.__add__(t)` instead of `s += t`, you get different ids, so they're not really equivalent. – Kelly Bundy Jun 10 '22 at 11:30
  • 1
    I guess [this](https://stackoverflow.com/q/61636151/12671057) proves the value of keeping it short... (Same question but zero votes, maybe due to its off-putting length.) – Kelly Bundy Jun 10 '22 at 15:35
  • Does this answer your question? [How is the s=s+c string concat optimization decided?](https://stackoverflow.com/questions/69079181/how-is-the-s-sc-string-concat-optimization-decided) – S.B Jun 12 '22 at 16:11
  • @S.B just because something is related and older does not mean this should be closed as duplicate and readers should be *diverted away from these excellent answers* and to that answer instead. Duplicates can point forward in time as well, so if these are really duplicates then perhaps that should be closed as duplicate of this instead? – uhoh Jun 12 '22 at 21:17
  • @uhoh Actually it is not just "related" , it is exactly the same question. If this is not the case where we should use the duplicate flag, then when is the proper time? These answers are of course fantastic I upvoted too, but if the question is closed as duplicate, still users can see these answers. They won't be deleted. I saw a lot of great questions/answers which received more than thousands of upvotes but closed as duplicate. – S.B Jun 12 '22 at 21:38
  • @S.B so close that one not this one because these answers are much better. Nobody knows how answer visibility is impacted when questions are closed. [Have there been analyses to see if views of well-received answers are reduced by closure as duplicate which try to control for other factors?](https://meta.stackexchange.com/q/366884/303080) But older does not equal better. – uhoh Jun 12 '22 at 21:39
  • @uhoh https://meta.stackoverflow.com/a/278963/13944524 – S.B Jun 12 '22 at 22:03
  • @S.B so instead of oldest answers, lets steer future readers to the *best answers*; after all, good answers to on-topic questions is what it's all about, right? (you could also propose merging them if you think the old answers help here) – uhoh Jun 12 '22 at 22:19
  • @uhoh The linked question is not really old(9 months ago) and has more details in the question itself. It shows the performance comparison of this behavior as well so that makes the question even more exciting... Both questions received great answers from two knowledgeable guys I know here(That one is from Tim Peters). With all due respect, I have one vote for myself and to me duplicate is duplicate. It still requires two more votes if others think like me. – S.B Jun 12 '22 at 22:37
  • @uhoh This is the first time I hear about merging. Could you tell me how can I request for that? by flagging for the moderator? – S.B Jun 12 '22 at 22:54
  • 1
    @S.B [What is a "merged" question?](https://meta.stackexchange.com/q/158066/303080) (surprisingly while it links back to [FAQ](https://meta.stackexchange.com/q/7931/303080) it doesn't appear there) It requires some moderator effort and decisions so they usually don't like to do it a lot. In SO: https://meta.stackoverflow.com/search?q=merging+questions – uhoh Jun 12 '22 at 23:05
  • 2
    These two questions do *not* seem to be duplicates. This question here is about whether optimisation happens at all and its semantics – re-using `id`s is *not* in itself an indication that an immutable type is mutated, and commonly happens because the memory is simply re-used. The other question already starts at taking the optimisation for granted and asks how CPython specifically implements it. Neither does this question's *answers* solve the other question, nor do the other question's *answers* solve this question. They are interesting trivia wrt each other but not more. – MisterMiyagi Jun 13 '22 at 07:31

4 Answers4

124

It's a CPython-specific optimization for the case when the str being appended to happens to have no other living references. The interpreter "cheats" in this case, allowing it to modify the existing string by reallocating (which can be in place, depending on heap layout) and appending the data directly, and often reducing the work significantly in loops that repeatedly concatenate (making it behave more like the amortized O(1) appends of a list rather than O(n) copy operations each time). It has no visible effect besides the unchanged id, so it's legal to do this (no one with an existing reference to a str ever sees it change unless the str was logically being replaced).

You're not actually supposed to rely on it (non-reference counted interpreters can't use this trick, since they can't know if the str has other references), per PEP8's very first programming recommendation:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).

For example, do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn’t present at all in implementations that don’t use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

If you want to break the optimization, there are all sorts of ways to do so, e.g. changing your code to:

>>> while i!=0:
...     s_alias = s  # Gonna save off an alias here
...     s += str(i)
...     print(s + " stored at " + str(id(s)))
...     i -= 1
... 

breaks it by creating an alias, increasing the reference count and telling Python that the change would be visible somewhere other than s, so it can't apply it. Similarly, code like:

s = s + a + b

can't use it, because s + a occurs first, and produces a temporary that b must then be added to, rather than immediately replacing s, and the optimization is too brittle to try to handle that. Almost identical code like:

s += a + b

or:

s = s + (a + b)

restores the optimization by ensuring the final concatenation is always one where s is the left operand and the result is used to immediately replace s.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • [Subtle way to avoid the optimization](https://tio.run/##K6gsycjPM7YoKPr/v1jBVkE9uKQoMy9dnSsTyDHlKs/IzElVyFRQtFUwsOJSAIJiBW1bheKSIo1MTTC/AKi8RKNYByKWolGsqQmRyFTQtVUw/P8fAA) (at least sometimes :-) – Kelly Bundy Jun 10 '22 at 14:13
  • 7
    @KellyBundy: Yep, this is why PEP8 says not to rely on it. It's *brittle*. It's nice when it helps you, but if you have code that *needs* it for adequate performance, your code needs to be rewritten to use `list` + `''.join`, or `io.StringIO` or whatever. (I deleted my old comments, you're welcome to do so) – ShadowRanger Jun 10 '22 at 14:16
  • @KellyBundy why did it avoid optimization? – ufoq Jun 12 '22 at 08:36
  • 1
    @ufoq Note that unlike the question's code, I print `s`, not a new string built from `s`. And the output is buffered. Apparently `print` or the "file" it writes to [keeps a reference](https://tio.run/##K6gsycjPM7YoKPr/PzO3IL@oRKG4spirWMFWQT24pCgzL12dKxPIMeUqz8jMSVXIVFC0VTCw4lIAgmIFbVuF4pIijUxNML8AqLxEA6hdLz21pCg1LTm/FMTXRJHVgehIAYkToy1TQddWwfD/fwA) to `s` then. If I print [with flush=True](https://tio.run/##JYpLCoAwEMX2PcVzZYsKiggizCn0CH46IFo6FfH0tWh2CXFPsOfR9s7HKCDkY/B8bLniJJ26Le8LGBmhHhQSgoIgwWs2n7u0By3l32YtxpRY90ssTf5a/olREZoYXw), I get the optimization again. – Kelly Bundy Jun 12 '22 at 16:38
  • @KellyBundy: It looks like it's [due to how `io.TextIOWrapper.write` is implemented](https://github.com/python/cpython/blob/6fd4c8ec7740523bb81191c013118d9d6959bc9d/Modules/_io/textio.c#L1568); the buffer at that layer is maintained as either the single thing written, or (when there's more than one buffered thing) a `list` of things that have been written. As an optimization, when the encoding of the file is ASCII-compatible and the `str` is ASCII, it takes a reference to the `str` directly rather than encoding it to `bytes` before putting it in the `list`. – ShadowRanger Jun 14 '22 at 14:57
  • Change `s += str(i)` to `s += chr(130 - i)` and [the `write` optimization disappears, enabling the `str +=` optimization, at the moment the first non-ASCII character is appended](https://tio.run/##jYxBCoMwFET3OcV0ZaRVFBGkkFP0CDY1H1oT/v@lePo0dteds3szvEmbhrgOU@Kc6ZUiK2QTI3Cobsq0LpWhAqP5BHp6EE4O3dWgRHB2mAPbfujQgOpfm4qktpy0i1f2jzm@d67/1gtE2dJ9749ohMahz/kL "Python 3.8 (pre-release) – Try It Online"), because `TextIOWrapper.write` is now proactively encoding the non-ASCII `str` to `bytes`, and dropping the reference to the original `str` before it returns. – ShadowRanger Jun 14 '22 at 15:00
44

Regardless of implementation details, the docs say:

… Two objects with non-overlapping lifetimes may have the same id() value.

The previous object referenced by s no longer exists after the += so the new object breaks no rules by having the same id.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • 12
    The lifetimes are defined to overlap though, per [the language data model spec](https://docs.python.org/3/reference/datamodel.html#object.__iadd__), where `+=` is defined as one of `x = x.__iadd__(y)`, `x = x.__add__(y)` or `x = y.__radd__(x)` (and immutable types are defined by not implementing the former). The delayed reassignment to `x` after the work completes is mandatory, officially. In all cases, the new object returned from the method must exist prior to the reassignment of `x`, so the actual spec requires both to exist at once, and therefore the `id`s must differ, at least briefly. – ShadowRanger Jun 10 '22 at 04:27
  • 1
    Hmm... Just remembered "at least briefly" is being too lenient. `id`s are required to stay different, because the `id` of all objects is required to stay constant for its lifetime, so you can't perform shenanigans like giving the `id` of the original `str` to the new `str`. – ShadowRanger Jun 10 '22 at 04:39
  • 10
    @ShadowRanger No user code can run during that brief overlap, so there's no way to detect the shared ID during that period. That's why this optimization is OK. – Barmar Jun 10 '22 at 14:34
  • @Barmar Are you saying there *are* two separate objects with briefly shared ID? – Kelly Bundy Jun 10 '22 at 14:41
  • @Barmar: You may be right (excluding debuggers) in the design as of 3.11 (which limits the optimization to cases followed by `STORE_FAST`). It's weirder through 3.10 though (that is, the current release), as it applied to cases followed by `STORE_DEREF` or `STORE_NAME`, which could, in very specific circumstances involving threads sharing a closure variable, observe the clearing of the target, as the GIL could be taken by another thread between opcodes. and see the pre-cleared variable (they wouldn't see a shared `id`, because the spec was violated and it changed the target early). – ShadowRanger Jun 10 '22 at 14:51
  • 7
    @KellyBundy I'm saying that even if they conceptually overlap, you can't detect it so the optimizer can ignore that requirement. – Barmar Jun 10 '22 at 15:09
  • The term "lifetime" doesn't have a formal definition in the Python docs, so although the comment about `__iadd__` is interesting, it isn't quite right to say that the objects' lifetimes are *defined* to overlap. If the lifetime of the old string object is allowed to end before `x` no longer references that object, which would be perverse but not contradictory, then the lifetimes of the string objects don't overlap and they are allowed to have the same id. That said, I think it's simpler to describe the behaviour by saying that string objects are mutable but can never be observed to mutate. – kaya3 Jun 12 '22 at 18:46
  • 2
    A more clear-cut example of Python strings being mutable is the fact that when you compute the hash of a string, the result is cached so that the interpreter won't have to compute the hash of the same string object again. The cache for a string's hash is a mutable member of the struct for that string, it's initialised as -1 and then overwritten when the hash is computed. So that struct is unquestionably mutable; it just can't be observed to mutate (unless you count `time.perf_counter` as a way of observing it). – kaya3 Jun 12 '22 at 19:00
4

If objects have non-overlapping lifetimes, their id values may be the same, but if variables have overlapping lifetimes so they must have different id values.

Nazhir
  • 87
  • 4
1

In C, Java, or some other programming languages, a variable is an identifier or a name, connected to a memory location.

e.g.

enter image description here

but in Python, a variable is considered a tag that is tied to some value. Python considers value as an object. and it can save memory and assign memory location with respect to value instead of variable.

enter image description here

Variables are the same but if values are different then it can assign new memory, in a closer look no variable can be referenced a=10 then it is removed by the garbage collector and a new value hold the variable in a new memory location.

enter image description here

In C, Java memory location is for variable but in Python, the memory location is for value which can be treated as an object.

Mehmaam
  • 573
  • 7
  • 22