4

I know that Python strings are immutable, which means that

letters = "world"
letters += "sth"

would give me a different string object after the concatenation

begin: id(letters): 1828275686960
end: id(letters): 1828278265776

However, when I run a for-loop to append to a string, it turns out that the string object remain unchanged during the for-loop:

letters = "helloworld"
print("before for-loop:")
print(id(letters))
print("in for-loop")

for i in range(5):
    letters += str(i)
    print(id(letters))

The output:

before for-loop:
2101555236144
in for-loop
2101557044464
2101557044464
2101557044464
2101557044464
2101557044464

Apparently the underlying string object that letter points to did not change during the for-loop, which contradicts the concept that string should be immutable.

Is this some kind of optimization that Python performs under the hood?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
  • Why do you print `letters.__repr__` and not `id(letters)` as above? – mkrieger1 Oct 13 '21 at 13:18
  • `+=` is invoking `__iadd__` method. We must take a look at that method. – MSH Oct 13 '21 at 13:22
  • 6
    Note that the output does change at some point if you increase 5 to a larger number. – mkrieger1 Oct 13 '21 at 13:25
  • Note that the output is also the same if you unroll the loop manually. – mkrieger1 Oct 13 '21 at 13:26
  • 3
    This seems to be a optimization in CPython. Does this make sense https://web.eecs.utk.edu/~azh/blog/pythonstringsaremutable.html ? – Kris Oct 13 '21 at 13:37
  • @mkrieger1 Indeed – aProgrammingnoob Oct 13 '21 at 13:41
  • @Kris Great resources. – aProgrammingnoob Oct 13 '21 at 13:42
  • Formally speaking, this really shouldn't happen - `a += b` is supposed to be equivalent to `a = operator.iadd(a, b)`, where the new value and old value exist simultaneously before the new value is assigned to the variable. The CPython devs are okay with mutating a string in-place in this case because aside from speed, the observable behavior is *mostly* indistinguishable from the behavior without the optimization... but `id` is the one point where the behavior is in fact distinguishable. – user2357112 Oct 13 '21 at 14:43

1 Answers1

4

From the documentation:

id(object)

Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.

CPython implementation detail: This is the address of the object in memory.

The method id() is, in this case, the memory address of the stored string as the source code shows us:

static PyObject *
builtin_id(PyModuleDef *self, PyObject *v)
/*[clinic end generated code: output=0aa640785f697f65 input=5a534136419631f4]*/
{
    PyObject *id = PyLong_FromVoidPtr(v);

    if (id && PySys_Audit("builtins.id", "O", id) < 0) {
        Py_DECREF(id);
        return NULL;
    }

    return id;
} 

What happens is that the end and begin of life of the two objects do indeed overlap. Python guarantees the immutability of strings only as long as they are alive. As the article suggested by @kris shows:

import _ctypes
    
a = "abcd"
a += "e"

before_f_id = id(a)

a += "f"

print(a)
print( _ctypes.PyObj_FromPtr(before_f_id) ) # prints: "abcdef"

the string a ended is life and it is not guaranteed to be retrievable given is memory location, in fact the above example shows that it is reused for the new variable.

We can take a look at how it is implemented under the hood in the unicode_concatenate method looking at the last lines of codes:

res = v;
PyUnicode_Append(&res, w);
return res;

where v and w are those in the expression: v += w

The method PyUnicode_Append is in fact trying to reuse the same memory location for the new object, in detail in PyUnicode_Append:

PyUnicode_Append(PyObject **p_left, PyObject *right):

...

new_len = left_len + right_len;

if (unicode_modifiable(left)
    ...
{
    /* append inplace */
    if (unicode_resize(p_left, new_len) != 0)
        goto error;

    /* copy 'right' into the newly allocated area of 'left' */
    _PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
}
Marco D.G.
  • 2,317
  • 17
  • 31