10

I came across this weird behaviour which happens only in an interactive Python session, but not when I write a script and execute it.

String is an immutable data type in Python, hence:

>>> s2='string'
>>> s1='string'
>>> s1 is s2
True

Now, the weird part:

>>> s1='a string'
>>> s2='a string'
>>> s1 is s2
False

I have seen that having a whitespace in the string causes this behaviour. If I put this in a script and run it, the result is True in both cases.

Would anyone have a clue about this? Thanks.

EDIT:

Okay, the above question and answers give some ideas. Now here is another experiment:

>>> s2='astringbstring'
>>> s1='astringbstring'
>>> s1 is s2
True

In this case the strings are definitely longer than 'a string', but are still having the same identifiers.

  • 1
    See this post http://stackoverflow.com/questions/2123925/when-does-python-allocate-new-memory-for-identical-strings – isedev Mar 03 '13 at 03:53
  • 1
    Be aware that the interning rules can vary across Python implementations and versions. Apart from the idiomatic `is [not] None` case, use of `is` is extremely rare in Python; you should only use it when you really are concerned with object identity rather than value equality. – Russell Borogove Mar 03 '13 at 07:33

1 Answers1

6

Many thanks to @eryksun for the corrections!

This is because of a mechanism call interning in Python:

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Changed in version 2.3: Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it.

CPython will automatically intern short certain strings (1 letter strings, keywords, strings without spaces that have been assigned) to increase lookup speed and comparison speed: eg., 'dog' is 'dog' will be a pointer comparison instead of a full string comparison. However, automatic interning for all (longer) strings requires a lot more memory which is not always feasible, and thus they may not share the same identity which makes the results of id() different, for eg.,:

# different id when not assigned
In [146]: id('dog')
Out[146]: 4380547672

In [147]: id('dog')
Out[147]: 4380547552

# if assigned, the strings will be interned (though depends on implementation)
In [148]: a = 'dog'

In [149]: b = 'dog'

In [150]: id(a)
Out[150]: 4380547352

In [151]: id(b)
Out[151]: 4380547352

In [152]: a is b
Out[152]: True

For integers, at least on my machine, CPython will automatically intern up to 256 automatically:

In [18]: id(256)
Out[18]: 140511109257408

In [19]: id(256)
Out[19]: 140511109257408

In [20]: id(257)
Out[20]: 140511112156576

In [21]: id(257)
Out[21]: 140511110188504

UPDATE thanks to @eryksun: in this case the string 'a string' is not interned because CPython only interns strings without spaces, not because of the length as I instantly assumed: for eg., ASCII letters, digits, and underscore.

For more details, you can also refer to Alex Martelli's answer here.

Community
  • 1
  • 1
K Z
  • 29,661
  • 8
  • 73
  • 78
  • 2
    @Amit: It's the space character in the string. A CPython code object will only intern strings constants that are all "name characters" (ASCII letters, digits, and underscore). See CPython 2.7.3 [codeobject.c](http://hg.python.org/cpython/file/70274d53c1dd/Objects/codeobject.c#l71). – Eryk Sun Mar 03 '13 at 04:11
  • @eryksun thanks for the addition, I totally forgot spaces are not interned. – K Z Mar 03 '13 at 04:14
  • @eryksun Thanks very much. That was the missing piece indeed. –  Mar 03 '13 at 04:15
  • @KayZhu Can you please modify your answer considering eryksun's inputs, so that I can accept your answer? I cannot accept it in its present form, since you talk about "short" and "long" strings which is not the case. –  Mar 03 '13 at 04:18
  • @KayZhu: Your updated example is related to the peephole optimizer. See the function `fold_binops_on_constants` in [peephole.c](http://hg.python.org/cpython/file/70274d53c1dd/Python/peephole.c#l177). There the limit is 20, so `('X'*21 is 'X'*21) == False`. – Eryk Sun Mar 03 '13 at 04:36
  • @eryksun Thanks a lot of the info! I didn't know about peephole optimizer, definitely learned something :) I've updated my answer with a better example. – K Z Mar 03 '13 at 07:28
  • @KayZhu: Just to clarify, it's not the shortness of `'abc'` that counts. Interned strings in CPython are not immortal. Try `'zzz'` for example, and look at the reference counts. abc is the [Abstract Base Classes](http://docs.python.org/2/library/abc) module. – Eryk Sun Mar 03 '13 at 11:23
  • @eryksun You really should have been the one answering this questions :) Turned out my understanding that interning will vary between long and short strings was not accurate -- the only thing I could find that related to string length is [here](http://hg.python.org/cpython/file/0e41c4466d58/Objects/stringobject.c#l72), and that's only for strings with 1 characters. – K Z Mar 03 '13 at 12:07
  • The array of single-character string objects optimizes string subscription and iteration. It gets bypassed, however, if you use an escape such as `'\n'` or `'\x00'`, as opposed to `chr(10)` and `chr(0)`, so those will be new objects. Except, coming back to code objects, the name characters will be interned in the `interned` `dict`. Also, a bug in codeobject.c `all_name_chars` causes `'\x00'` to be interned as well. – Eryk Sun Mar 03 '13 at 18:33
  • @eryksun interesting, please feel free to edit the answer :) – K Z Mar 04 '13 at 08:10