9

I am unable to understand the following behaviour. I am creating 2 strings, and using is operator to compare it. On the first case, it is working differently. On the second case, it works as expected. What is the reason when I use comma or space, it is showing False on comparing with is and when no comma or space or other characters are used, it gives True

Python 3.6.5 (default, Mar 30 2018, 06:41:53) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 'string'
>>> b = a
>>> b is a
True
>>> b = 'string'
>>> b is a
True
>>> a = '1,2,3,4'
>>> b = a
>>> b is a
True
>>> b = '1,2,3,4'
>>> b is a
False

Is there a reliable information on why python interprets strings in different way? I understand that initially, a and b refers to same object. And then b gets a new object, still b is a says True. It is little confusing to understand the behaviour.

When I do it with 'string' - it produces same result. What's wrong when I use '1,2,3,4' - they both are strings. What's different from case 1 and case 2 ? i.e is operator producing different results for different contents of the strings.

Sibidharan
  • 2,717
  • 2
  • 26
  • 54
  • I understand how == works. I am not referring == in here. It is completely with is. Check my code and see post reference change, it still produces same result. – Sibidharan Apr 26 '18 at 07:53
  • 2
    @deceze This question is not about identity and value check it's about how differently Python caches strings (string interning). – Mazdak Apr 26 '18 at 07:53
  • 1
    @khelwood this looks like a subtly different question. Why does the `is` operator change result when the same operations are performed but with different contents of the strings. – FHTMitchell Apr 26 '18 at 07:53
  • 1
    Yes there is something to this question. If you just have `a = '1,'`, then `b = '1,'` then they are apparently different objects. There is something arcane going on related to which strings are interned. – khelwood Apr 26 '18 at 07:55
  • 1
    Also see [About the changing id of an immutable string](//stackoverflow.com/q/24245324) – Martijn Pieters Apr 26 '18 at 07:56
  • @MartijnPieters Great explanation – khelwood Apr 26 '18 at 07:58
  • @khelwood: now try `a = '1,'; b = '1,'`, or put those two assignments in a function and have the function return the two. Etc. String interning and code object constants and any number of such optimisations can lead to re-use of the same object for different references, but that doesn't mean you should count on this and use `is` over `==` – Martijn Pieters Apr 26 '18 at 07:58
  • Plenty of duplicates that touch on this… https://stackoverflow.com/q/1392433/476, https://stackoverflow.com/q/1504717/476, https://stackoverflow.com/q/2858603/476, https://stackoverflow.com/q/2987958/476 – deceze Apr 26 '18 at 07:58
  • @deceze: that first one is also using comparison chaining, fun! The expression *in the first line* produces `False`, not `True`. – Martijn Pieters Apr 26 '18 at 07:59
  • Python 2 blog, but still relevant http://guilload.com/python-string-interning/ – Chris_Rands Apr 26 '18 at 08:05
  • this question is probably a duplicate. But since there are 23152 possible targets, which one to choose? – Jean-François Fabre Apr 26 '18 at 13:39

1 Answers1

10

One important thing about this behavior is that Python caches some, mostly, short strings (usually less than 20 characters but not for every combination of them) so that they become quickly accessible. One important reason for that is that strings are widely used in Python's source code and it's an internal optimization to cache some special sorts of strings. Dictionaries are one of the generally used data structures in Python's source code that are used for preserving the variables, attributes, and namespaces in general, plus for some other purposes, and they all use strings as the object names. This is to say that every time you try to access an object attribute or have access to a variable (local or global) there's a dictionary lookup firing up internally.

Now, the reason that you got such bizarre behavior is that Python (CPython implementation) treats differently with strings in terms of interning. In Python's source code, there is a intern_string_constants function that gives strings the validation to be interned which you can check for more details. Or check this comprehensive article http://guilload.com/python-string-interning/.

It's also noteworthy that Python has an intern() function in the sys module that you can use to intern strings manually.

In [52]: b = sys.intern('a,,')

In [53]: c = sys.intern('a,,')

In [54]: b is c
Out[54]: True

You can use this function either when you want to fasten the dictionary lookups or when you're ought to use a particular string object frequently in your code.

Another point that you should not confuse with string interning is that when you do a == b, you're creating two references to the same object which is obvious for those keywords to have the same id.

Regarding punctuations, it seems that if they are one character they get interned if their length is more than one. If the length is more than one they won't get cached. As mentioned in the comments, one reason for that might be because it's less likely for keywords and dictionary keys to have punctuations in them.

In [28]: a = ','

In [29]: ',' is a
Out[29]: True

In [30]: a = 'abc,'

In [31]: 'abc,' is a
Out[31]: False

In [34]: a = ',,'

In [35]: ',,' is a
Out[35]: False

# Or

In [36]: a = '^'

In [37]: '^' is a
Out[37]: True

In [38]: a = '^%'

In [39]: '^%' is a
Out[39]: False

But still, these are just some speculations that you cannot rely on in your code.

Wouter
  • 534
  • 3
  • 14
  • 22
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • I understood that python creates a pool of strings to make them more accessible but what is the difference between `1` and `1,` that causes them to have different ids? – BcK Apr 26 '18 at 08:04
  • Moreover I realized that if I use multiple assignment like `a = b = c = '1,'`, all of them have the same id. Link for this statement : https://stackoverflow.com/questions/35275026/python3-multiple-assignment-and-memory-address – BcK Apr 26 '18 at 08:08
  • Look at the source code. Crucial point: We're not talking about Python here, but one implematation of Python (CPython). PyPy might behave differently. – Matthias Apr 26 '18 at 08:09
  • @BcK As I mentioned that's a Cpython-implementation details, you can check the source for that. Regarding that assignments please check the update. – Mazdak Apr 26 '18 at 08:11
  • @Kasramvd To be succint, does that mean CPython does not include punctuation or possibly any other characters when caching a short string ? – BcK Apr 26 '18 at 08:16
  • 1
    @BcK Check out the update. – Mazdak Apr 26 '18 at 08:23
  • @Kasramvd That answered my question, thank you for the detailed explanation. – Sibidharan Apr 26 '18 at 08:25
  • @Kasramvd Thanks, in the end `is` operator is tricky to use I guess, avoiding might be a better option. – BcK Apr 26 '18 at 08:27
  • @BcK No, before you decide to avoid that or not you should know how it performs and see if identity check is necessary in your case or not. When you use a high level language like Python with so much abstraction you always need to be aware of these unexpected behavior. – Mazdak Apr 26 '18 at 08:43
  • string interning doesn't intern strings containing some non-alphanums, because they're less likely to be reused as keys/class/type names. – Jean-François Fabre Apr 26 '18 at 13:41
  • @Jean-FrançoisFabre Yeah that's very likely. I was thinking about that while writing this answer, don't know why I forgot to mention it. Thanks for the suggestion tho. – Mazdak Apr 26 '18 at 17:11