118

While this question doesn't have any real use in practice, I am curious as to how Python does string interning. I have noticed the following.

>>> "string" is "string"
True

This is as I expected.

You can also do this.

>>> "strin"+"g" is "string"
True

And that's pretty clever!

But you can't do this.

>>> s1 = "strin"
>>> s2 = "string"
>>> s1+"g" is s2
False

Why wouldn't Python evaluate s1+"g", and realize it is the same as s2 and point it to the same address? What is actually going on in that last block to have it return False?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Ze'ev G
  • 1,646
  • 2
  • 12
  • 15

2 Answers2

119

This is implementation-specific, but your interpreter is probably interning compile-time constants but not the results of run-time expressions.

In what follows CPython 3.9.0+ is used.

In the second example, the expression "strin"+"g" is evaluated at compile time, and is replaced with "string". This makes the first two examples behave the same.

If we examine the bytecodes, we'll see that they are exactly the same:

  # s1 = "string"
  1           0 LOAD_CONST               0 ('string')
              2 STORE_NAME               0 (s1)

  # s2 = "strin" + "g"
  2           4 LOAD_CONST               0 ('string')
              6 STORE_NAME               1 (s2)

This bytecode was obtained with (which prints a few more lines after the above):

import dis

source = 's1 = "string"\ns2 = "strin" + "g"'
code = compile(source, '', 'exec')
print(dis.dis(code))

The third example involves a run-time concatenation, the result of which is not automatically interned:

  # s3a = "strin"
  3           8 LOAD_CONST               1 ('strin')
             10 STORE_NAME               2 (s3a)

  # s3 = s3a + "g"
  4          12 LOAD_NAME                2 (s3a)
             14 LOAD_CONST               2 ('g')
             16 BINARY_ADD
             18 STORE_NAME               3 (s3)
             20 LOAD_CONST               3 (None)
             22 RETURN_VALUE

This bytecode was obtained with (which prints a few more lines before the above, and those lines are exactly as in the first block of bytecodes given above):

import dis

source = (
    's1 = "string"\n'
    's2 = "strin" + "g"\n'
    's3a = "strin"\n'
    's3 = s3a + "g"')
code = compile(source, '', 'exec')
print(dis.dis(code))

If you were to manually sys.intern() the result of the third expression, you'd get the same object as before:

>>> import sys
>>> s3a = "strin"
>>> s3 = s3a + "g"
>>> s3 is "string"
False
>>> sys.intern(s3) is "string"
True

Also, Python 3.9 prints a warning for the last two statements above:

SyntaxWarning: "is" with a literal. Did you mean "=="?

0 _
  • 10,524
  • 11
  • 77
  • 109
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 33
    And for the record: Python's peep-hole optimisation will pre-calculate arithmetic operations on constants (`"string1" + "s2"`, `10 + 3*20`, etc.) at compile time, but limits resulting *sequences* to just 20 elements (to prevent `[None] * 10**1000` from overly expanding your bytecode). It is this optimisation that collapsed `"strin" + "g"` into `"string"`; the result is shorter than 20 characters. – Martijn Pieters Jul 10 '13 at 09:12
  • 17
    And to make it doubly clear: there is not interning going on here at all. Immutable literals are instead stored as constants with the bytecode. Interning *does* take place for names used in code, but not for string values created by the program unless specifically interned by the `intern()` function. – Martijn Pieters Feb 12 '14 at 11:23
  • 11
    For those, who tries to find `intern` function in Python 3 - it is moved to [sys.intern](https://docs.python.org/3/library/sys.html?highlight=sys#sys.intern) – Timofey Chernousov Jul 21 '17 at 08:55
5

Case 1

>>> x = "123"  
>>> y = "123"  
>>> x == y  
True  
>>> x is y  
True  
>>> id(x)  
50986112  
>>> id(y)  
50986112  

Case 2

>>> x = "12"
>>> y = "123"
>>> x = x + "3"
>>> x is y
False
>>> x == y
True

Now, your question is why the id is same in case 1 and not in case 2.
In case 1, you have assigned a string literal "123" to x and y.

Since string are immutable, it makes sense for the interpreter to store the string literal only once and point all the variables to the same object.
Hence you see the id as identical.

In case 2, you are modifying x using concatenation. Both x and y has same values, but not same identity.
Both points to different objects in memory. Hence they have different id and is operator returned False

cppcoder
  • 22,227
  • 6
  • 56
  • 81
  • How come, since strings are immutable, assigning x+"3" (and looking for a new spot to store the string) doesn't assign to the same reference as y? – nicecatch Aug 09 '16 at 21:14
  • Because then it needs to compare the new string with all existing strings; potentially a very expensive operation. It could do this in the background after assignment I suppose, to reduce memory, but then you would end up with even stranger behaviour: `id(x) != id(x)` for instance, because the string was moved in the process of evaluation. – DylanYoung Sep 23 '17 at 19:55
  • 1
    @AndreaConte because strings' concatenation doesn't do the extra job of looking up into the pool of all the used strings each time it generates a new one. On the other hand, interpreter "optimizes" the expression `x = "12" + "3"` into `x = "123"` (concatenation of two string literals in a single expression) so the assignment actually does the lookup and finds the same "internal" string as for `y = "123"`. – derenio Sep 25 '17 at 08:38
  • Actually, it isn't that assignment does the lookup rather than every string literal from the source code gets "internalized" and that object gets reused in all the other places. – derenio Sep 25 '17 at 08:42