8

This is for Python 2.6.

I could not figure out why a and b are identical:

>>> a = "some_string"
>>> b = "some_string"
>>> a is b
True

But if there is a space in the string, they are not:

>>> a = "some string"
>>> b = "some string"
>>> a is b
False

If this is normal behavior, could someone please explain what is going on.

Edit: Disclaimer ! This is not being used to check for equality. I actually wanted to explain to someone else that "is" is only to check identity, not equality. And from the documentation I had understood that references created in this way will be different, that a new string will be created each time. The very first example I gave threw me off when I couldn't prove my own point!

Edit: I understand that this is not a bug, and interning was a new concept for me. This seems to be a good explanation.

sarshad
  • 331
  • 1
  • 9
  • see http://stackoverflow.com/questions/306313/python-is-operator-behaves-unexpectedly-with-integers – cobbal Nov 12 '10 at 14:39
  • "Bug?" Questions really irritate me.. but I guess that's not a reason to mark a question down... is it? – MattH Nov 12 '10 at 14:42
  • @cobbal: That's slightly different than what's happening here. – Ignacio Vazquez-Abrams Nov 12 '10 at 14:44
  • @Ignacio: Different underlying cause, though the same basic category: some immutable objects being cached and others not, based on some unspecified heuristic or another. – Glenn Maynard Nov 12 '10 at 14:48
  • one of many examples: [How is the 'is' keyword implemented in Python?](http://stackoverflow.com/questions/2987958/how-is-the-is-keyword-implemented-in-python) – Jochen Ritzel Nov 12 '10 at 15:00
  • Sounds like he's asking why they're not always the same reference or not, not about what `is` does, particularly given that he used the word *reference* in the title, which most people confusing `is` with equality wouldn't. – Glenn Maynard Nov 12 '10 at 15:04
  • @MattH: I understand now that its not a bug, but I think it would be better to keep the question in case noobs like me think it is! – sarshad Nov 12 '10 at 18:37

3 Answers3

10

Python may or may not automatically intern strings, which determines whether future instances of the string will share a reference.

If it decides to intern a string, then both will refer to the same string instance. If it doesn't, it'll create two separate strings that happen to have the same contents.

In general, you don't need to worry about whether this is happening or not; you usually want to check equality, a == b, not whether they're the same object, a is b.

Glenn Maynard
  • 55,829
  • 10
  • 121
  • 131
  • 1
    Just out of curiosity, do you know what the logic is for when Python interns strings? I thought it interned every string literal but apparently not... – Katriel Nov 12 '10 at 14:48
  • @katrielalex: Strings from a .py file are a bit different: they're stored in a list of constants which are referred to by index in bytecode. That means that if you use the same string constant in two places in one .py file, it's almost guaranteed that they'll be the same object. That's different than interning. However, the OP isn't running out of a .py file, he's running from the Python interpreter; that means they're parsed as separate "files", and don't have a shared string table. – Glenn Maynard Nov 12 '10 at 14:58
1

TIM PETERS SAID: Sorry, the only bug I see here is in the code you posted using "is" to try to determine whether two strings are equal. "is" tests for object identity, not for equality, and whether two immutable objects are really the same object isn't in general defined by Python. You should use "==" to check two strings for equality. The only time it's reliable to use "is" for that purpose is when you've explicitly interned all strings being compared (via using the intern() builtin function).

from here: http://mail.python.org/pipermail/python-bugs-list/2004-December/026772.html

Ant
  • 5,151
  • 2
  • 26
  • 43
  • -1: for immutable objects like strings you would expect equal references, so this is not a bug. – Falcon Nov 12 '10 at 14:49
  • @Falcon: No. That strings are immutable *allows* (efficient) interning (= identical references for identical objects), but it doesn't require it. –  Nov 12 '10 at 14:57
  • @Falcon: Well you'd be mistaken in that expectation. Checking equality between native python code and python-wrapped c-library, for example: openssl library. That's even if you didn't know that not all strings are interned. Shame I can't mark you down. – MattH Nov 12 '10 at 14:58
  • 1
    It's never reliable to use `is` for this. `intern` may be a no-op: a Python implementation may choose not to use string interning (making intern simply return its parameter), and it may also choose to ignore requests to intern a string. For example, CPython silently ignores requests to intern subclasses of `string`. – Glenn Maynard Nov 12 '10 at 15:08
0

This should actually be more of a comment to Gleen's answer but I can not do comments yet. I've run some tests directly on the Python interpreter and I saw some interesting behavior. According to Glenn, the interpreter treats entries as separate "files" and they don't share a string table when stored for future reference. Here is what I run:

>>> a="some_string"
>>> b="some_string"
>>> id(a)
2146597048
>>> id(b)
2146597048
>>> a="some string"
>>> b="some string"
>>> id(a)
2146597128
>>> id(b)
2146597088
>>> c="some string"   <-----(1)
>>> d="some string"   
>>> id(c)
2146597208            <-----(1)
>>> a="some_string"
>>> b="some_string"
>>> id(a)
2146597248            <---- waited a few minutes
>>> c="some_string" 
>>> d="some_string"
>>> id(d)
2146597248            <---- still same id after a few min
>>> b="some string"
>>> id(b)
2146597288
>>> b="some_string"  <---(2)
>>> id(b)
2146597248           <---(2)
>>> a="some"
>>> b="some"
>>> c="some"
>>> d="some"         <---(2) lost all references
>>> id(a)
2146601728
>>> a="some_string"  <---(2)
>>> id(a)
2146597248           <---(2) returns same old one after mere seconds
>>> a="some"
>>> id(a)
2146601728           <---(2) Waited a few minutes
>>> a="some_string"   <---- (1)
>>> id(a)
2146597208            <---- (1) Reused a "different" id after a few minutes

It seems that some of the id references might be reused after the initial references are lost and no longer "in use" (1), but it might also be related to the time those id references are not being used, as you can see in what i have marked as number (2), giving different id references depending on how long that id has not been used. I just find it curious and thought of posting it.

Luis Herrera
  • 61
  • 2
  • 3