11

I'm completely stuck with this

>>> s = chr(8263)
>>> x = s[0]
>>> x is s[0]
False

How is this possible? Does this mean that accessing a string character by indexing create a new instance of the same character? Let's experiment:

>>> L = [s[0] for _ in range(1000)]
>>> len(set(L))
1
>>> ids = map(id, L)
>>> len(set(ids))
1000
>>>

Yikes what a waste of bytes ;) Or does it mean that str.__getitem__ has a hidden feature? Can somebody explain?

But this is not the end of my surprise:

>>> s = chr(8263)
>>> t = s
>>> print(t is s, id(t) == id(s))
True True

This is clear: t is an alias for s, so they represent the same object and identities coincide. But again, how the following is possible:

>>> print(t[0] is s[0])
False

s and t are the same object so what?

But worse:

>>> print(id(t[0]) == id(s[0]))
True

t[0] and s[0] have not been garbage collected, are considered as the same object by the is operator but have different ids? Can somebody explain?

P. Ortiz
  • 174
  • 7

2 Answers2

7

There are two points to make here.

First, Python does indeed create a new character with the __getitem__ call, but only if that character has ordinal value greater than 256.

For example:

>>> s = chr(256)
>>> s[0] is s
True

>>> t = chr(257)
>>> t[0] is t
False

This is because internally, the compiled getitem function checks the ordinal value of the single chracter and calls the get_latin1_char if that value is 256 or less. This allows some single-character strings to be shared. Otherwise, a new unicode object is created.

The second issue concerns garbage collection and shows that the interpreter can reuse memory addresses very quickly. When you write:

>>> s = t # = chr(257)
>>> t[0] is s[0]
False

Python first creates two new single character strings and then compares their memory addresses. These have different addresses (we have different objects as per the explanation above) so comparing the objects with is returns False.

On the other hand, we can have the seemingly paradoxical situation that:

>>> id(t[0]) == id(s[0])
True

But this is because the interpreter quickly reuses the memory address of t[0] when it creates the new string s[0] at a later moment in time.

If you examine the bytecode this line produces (e.g. with dis - see below), you see that the addresses for each side are allocated one after the other (a new string object is created and then id is called on it).

The references to the object t[0] drop to zero as soon as id(t[0]) is returned (we're doing the comparison on integers now, not the object itself). This means that s[0] can reuse the same memory address when it is created afterwards.


Here is the disassembled bytecode for the line id(t[0]) == id(s[0]) which I've annotated.

You can see that the lifetime of t[0] ends before s[0] is created (there are no references to it) hence its memory can be reused.

  2           0 LOAD_GLOBAL              0 (id)
              3 LOAD_GLOBAL              1 (t)
              6 LOAD_CONST               1 (0)
              9 BINARY_SUBSCR                     # t[0] is created
             10 CALL_FUNCTION            1        # id(t[0]) is computed...
                                                  # ...lifetime of string t[0] over
             13 LOAD_GLOBAL              0 (id)
             16 LOAD_GLOBAL              2 (s)
             19 LOAD_CONST               1 (0)
             22 BINARY_SUBSCR                     # s[0] is created...
                                                  # ...free to reuse t[0] memory
             23 CALL_FUNCTION            1        # id(s[0]) is computed
             26 COMPARE_OP               2 (==)   # the two ids are compared
             29 RETURN_VALUE
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • Thanks for you informative and convicing response. But it raises one question: doesn't it contradict the immutability of strings? By definition, immutability of a string `s` means that the `s[valid_index]` object never changes during the string life time? – P. Ortiz Oct 27 '15 at 16:19
  • @P.Ortiz: no problem, glad it helped. If I understood you correctly, then this doesn't contradict the immutability of string objects as far as I can see. For the first part, it seems that `s[i]` just checks whether the i-th character of `s` is in the 0-256 range and, if it is, returns the singleton string object given by `chr(i)`. For the comparing `id` part, the lifetime of `t[0]` is over as far as Python is concerned and so `s[0]` is free to occupy the memory it was in. No string objects are mutated. – Alex Riley Oct 27 '15 at 16:37
  • I try to to be clearer: the Python Ref. docs explains: "when we talk about the mutability of a container, only the *identities* of the immediately contained objects are implied." So now, suppose you have the following string: `s=chr(8263)`. If you perform indexing on `s`, for instance an innocent assignment as `x=s[0]`, *identity* of of the first item in `s` has *changed* (`str.getitem` returns a new value as explained in your response). What I mean is that `str.__getitem__` is a mutating method (string state has been altered). But, an immutable object doesn't have any mutating methods. – P. Ortiz Oct 27 '15 at 17:03
  • Writing `x = s[0]` doesn't mutate or alter `s` at all. If `s` is just the single character given by `chr(8263)` then `x = s[0]` creates a new string object in memory and assigns the name `x` to it. So now `s[0]` is a representation of the same unicode character as `s`, but it exists at new address in memory (it has a different identity). The string object `s` has not been changed in any way. – Alex Riley Oct 27 '15 at 17:19
  • Indeed, as you said : *it has a different identity* and that's the problem for me (recall what Python docs explains : "when we talk about the mutability of a container, only the **identities** of the immediately contained objects are implied" (emphasis is mine). – P. Ortiz Oct 27 '15 at 17:26
  • That quote is certainly true but makes more sense for containers like tuples, rather than strings. Tuples themselves are immutable but might contain mutable objects such as lists; you can change the list that the tuple contains. Strings are also immutable not containers in this same sense; they do not contain other Python objects. `s[0]` is not returning a character contained `s`, it is looking at the first byte (in the case of UTF-8) of `s` and creating a whole new string object that is equal to it. – Alex Riley Oct 27 '15 at 20:45
0

is compare identities and == compare values. Check this doc

Every object has an identity, a type and a value. An object’s identity never changes once it has been created; you may think of it as the object’s address in memory. The ‘is‘ operator compares the identity of two objects; the id() function returns an integer representing its identity (currently implemented as its address). An object’s type is also unchangeable.

Shapi
  • 5,493
  • 4
  • 28
  • 39