31

I've started learning Python (python 3.3) and I was trying out the is operator. I tried this:

>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False
>>> c = 'isitthespace'
>>> d = 'isitthespace'
>>> c is d
True
>>> e = 'isitthespace?'
>>> f = 'isitthespace?'
>>> e is f
False

It seems like the space and the question mark make the is behave differently. What's going on?

EDIT: I know I should be using ==, I just wanted to know why is behaves like this.

Elazar
  • 20,415
  • 4
  • 46
  • 67
luisdaniel
  • 907
  • 11
  • 20
  • 8
    For the record you should be using `==` to compare any item for equality but this is an interesting question nonetheless – jamylak May 26 '13 at 06:10
  • Probably some kind of string interning is causing `a is b` (noticing the string constant assigned to `b` has already been created and re-using it). The interning rule must care about spaces (or possibly length) – Ben Jackson May 26 '13 at 06:21
  • 1
    Hmm... I have different results while using file instead of writing in interpreter. [The same in ideone.](http://ideone.com/o1Kovp) – awesoon May 26 '13 at 06:22
  • 1
    For whatever reason `id('ab')` consistently returns the same value in my shell while `id('a ')` consistently changes. I still have no idea why letters would have different behavior, but it's interesting to observe. Perhaps Python makes some kind of optimization by assuming that strings will often contain letters? I don't think that would make much sense but it's hard to explain this behavior. This is an interesting question. – Nolen Royalty May 26 '13 at 06:34
  • I would still like to see a definitive answer to this regarding CPython – jamylak May 26 '13 at 08:23
  • As you already know about what `is` really does, maybe [this question](http://stackoverflow.com/questions/10622472/when-does-python-choose-to-intern-a-string) would be helpful - if it contained a useful answer. – glglgl May 26 '13 at 08:40
  • read this http://stackoverflow.com/questions/11476190/why-0-6-is-6-false – Dmitry Zagorulkin May 26 '13 at 10:23

6 Answers6

26

Warning: this answer is about the implementation details of a specific python interpreter. comparing strings with is==bad idea.

Well, at least for cpython3.4/2.7.3, the answer is "no, it is not the whitespace". Not only the whitespace:

  • Two string literals will share memory if they are either alphanumeric or reside on the same block (file, function, class or single interpreter command)

  • An expression that evaluates to a string will result in an object that is identical to the one created using a string literal, if and only if it is created using constants and binary/unary operators, and the resulting string is shorter than 21 characters.

  • Single characters are unique.

Examples

Alphanumeric string literals always share memory:

>>> x='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> y='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> x is y
True

Non-alphanumeric string literals share memory if and only if they share the enclosing syntactic block:

(interpreter)

>>> x='`!@#$%^&*() \][=-. >:"?<a'; y='`!@#$%^&*() \][=-. >:"?<a';
>>> z='`!@#$%^&*() \][=-. >:"?<a';
>>> x is y
True 
>>> x is z
False 

(file)

x='`!@#$%^&*() \][=-. >:"?<a';
y='`!@#$%^&*() \][=-. >:"?<a';
z=(lambda : '`!@#$%^&*() \][=-. >:"?<a')()
print(x is y)
print(x is z)

Output: True and False

For simple binary operations, the compiler is doing very simple constant propagation (see peephole.c), but with strings it does so only if the resulting string is shorter than 21 charcters. If this is the case, the rules mentioned earlier are in force:

>>> 'a'*10+'a'*10 is 'a'*20
True
>>> 'a'*21 is 'a'*21
False
>>> 'aaaaaaaaaaaaaaaaaaaaa' is 'aaaaaaaa' + 'aaaaaaaaaaaaa'
False
>>> t=2; 'a'*t is 'aa'
False
>>> 'a'.__add__('a') is 'aa'
False
>>> x='a' ; x+='a'; x is 'aa'
False

Single characters always share memory, of course:

>>> chr(0x20) is ' '
True
Elazar
  • 20,415
  • 4
  • 46
  • 67
  • 1
    Probably then the strings are not interned as I thought before, but taken - as string literals - from the same pool of strings inside a module. – glglgl May 26 '13 at 08:57
  • 1
    It is possible that when you hard-code `'a'*20` (or addition of string literals) the interpreter makes an optimizing decision and replaces it with the resulting string `'aaaaaaaaaaaaaaaaaaaa'`, so no string manipulation is done in runtime, (faster execution, but larger compiled code); but when the multiplication is too big the optimization does not activate, in to keep the code size small. – Iftah May 26 '13 at 12:03
  • 1
    @lftah `dis.dis(lambda: 'x'*20)` results in `LOAD_CONST 3 ('xxxxxxxxxxxxxxxxxxxx')`, `dis.dis(lambda: 'aaaaa' + "aaa")` in `LOAD_CONST 3 ('aaaaaaaa')`. I could imagine that a method call (e. g. `.join()` is "too complicated" in that sense. – glglgl May 26 '13 at 20:23
  • 1
    BTW, `dis.dis(lambda: 'x'*21)` leads to `LOAD_CONST 1 ('x')` `LOAD_CONST 2 (21)` `BINARY_MULTIPLY` – glglgl May 26 '13 at 20:25
  • Seems like the compiler is doing simple constant propagation, including '*' or '+' if the result is less than 21 characters. – Elazar May 26 '13 at 20:31
  • What you show is only how this specific implementation behaves. In other words, it is only implementational detail, optimization, and it should not be considered the language feature. Think about the situation when the algorithm is to be distributed. The operator `is` should not be misused only because of the features of the implementation. – pepr May 26 '13 at 20:31
  • @pepr, we've been through this already. The implementation details are what is interesting here. We do not try to teach how to use python in the correct way right now. – Elazar May 26 '13 at 20:34
  • @Elazar: I see. Then, isn't it better to look inside the implementation? – pepr May 26 '13 at 20:39
  • @pepr you are more than welcome to look inside it and tell us. I found it easier for me to knock on the walls; but by all means - go ahead. Looking inside the implementation is indeed the right way to do it. – Elazar May 26 '13 at 20:42
  • +1 great answer, In case someone is using **Ipython shell** then ``x='`!@#$%^&*() \][=-. >:"? – Ashwini Chaudhary May 26 '13 at 21:35
  • @Elazar: Great answer +1 :) Where did you find these informations. In which source file is this implemented? – aldeb Apr 20 '15 at 19:19
  • Thanks. There's a link in the answer. But generally I just played with the REPL. – Elazar Apr 20 '15 at 20:17
16

To expand on Ignacio’s answer a bit: The is operator is the identity operator. It is used to compare object identity. If you construct two objects with the same contents, then it is usually not the case that the object identity yields true. It works for some small strings because CPython, the reference implementation of Python, stores the contents separately, making all those objects reference to the same string content. So the is operator returns true for those.

This however is an implementation detail of CPython and is generally neither guaranteed for CPython nor any other implementation. So using this fact is a bad idea as it can break any other day.

To compare strings, you use the == operator which compares the equality of objects. Two string objects are considered equal when they contain the same characters. So this is the correct operator to use when comparing strings, and is should be generally avoided if you do not explicitely want object identity (example: a is False).


If you are really interested in the details, you can find the implementation of CPython’s strings here. But again: This is implementation detail, so you should never require this to work.

poke
  • 369,085
  • 72
  • 557
  • 602
  • 1
    Anyone reading this using python 2.x should also be aware that they should not use `a is False`, as booleans aren't singletons. – sapi May 26 '13 at 08:28
  • 1
    @sapi not even python3 should do that – jamylak May 26 '13 at 09:00
  • 3
    `a is False` makes no sense. The correct spelling is `not a`. – Lennart Regebro May 26 '13 at 10:10
  • 5
    `a is False` is beautiful english. Apparently it is too good to be True :) – Elazar May 26 '13 at 10:12
  • @LennartRegebro - you might want to compare two boolean expressions for equality though (in which case, `==` is correct). – detly May 26 '13 at 10:31
  • @sapi: how about assuming Python 2.5 or later? My belief is that assuming this minimum version they are "singletons" (not the right word, but it'll do); is that really not so? Can you give an example? – Chris Morgan May 26 '13 at 13:17
  • @ChrisMorgan - type `False = True` into the shell; that's why comparing to `True` or `False` at all doesn't make any sense. – sapi May 26 '13 at 13:23
  • Ah yes, I suppose so---but that doesn't, of itself, mean `is True` or `is False` is a bad idea. If anyone does things like assigning to the names `False` or `True` they should *expect* things to break. If identity comparison with the names `True` and `False` is not sound, then assignation of the values `True` and `False` is similarly not sound. – Chris Morgan May 26 '13 at 14:02
  • @ChrisMorgan: You should think more about it. The `a is False` *is* bad idea, indeed (independently on whether `False` and `True` are variables, constants, or keywords. From the linguistic/logical point of view, think about expressions like `if not ambiguous is False then...` (the `ambiguous` is a wrong identifier on its own). The `a is False` is the way of expressing a simple truth using rather complicated way. And this is not what should be done when programming. – pepr May 26 '13 at 20:12
  • 1
    @pepr: sure, I know that `is True` and `is False` will almost never be the right way of doing it, but my objection is to @sapi's original way of putting it, that the reason it shouldn't be used is that "booleans aren't singletons". – Chris Morgan May 26 '13 at 22:51
  • Yes. +1. Anyway, we should always go to the core of the problem. – pepr May 27 '13 at 08:41
4

The is operator relies on the id function, which is guaranteed to be unique among simultaneously existing objects. Specifically, id returns the object's memory address. It seems that CPython has consistent memory addresses for strings containing only characters a-z and A-Z.

However, this seems to only be the case when the string has been assigned to a variable:

Here, the id of "foo" and the id of a are the same. a has been set to "foo" prior to checking the id.

>>> a = "foo"
>>> id(a)
4322269384
>>> id("foo")
4322269384

However, the id of "bar" and the id of a are different when checking the id of "bar" prior to setting a equal to "bar".

>>> id("bar")
4322269224
>>> a = "bar"
>>> id(a)
4322268984

Checking the id of "bar" again after setting a equal to "bar" returns the same id.

>>> id("bar")
4322268984

So it seems that cPython keeps consistent memory addresses for strings containing only a-zA-Z when those strings are assigned to a variable. It's also entirely possible that this is version dependent: I'm running python 2.7.3 on a macbook. Others might get entirely different results.

Nolen Royalty
  • 18,415
  • 4
  • 40
  • 50
1

In fact your code amounts to comparing objects id (i.e. their physical address). So instead of your is comparison:

>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False

You can do:

>>> id(a) == id(b)
False

But, note that if a and b were directly in the comparison it would work.

>>> id('is it the space?') == id('is it the space?')
True

In fact, in an expression there's sharing between the same static strings. But, at the program scale there's only sharing for word-like strings (so neither spaces nor punctuations).

You should not rely on this behavior as it's not documented anywhere and is a detail of implementation.

deufeufeu
  • 998
  • 5
  • 11
0

Two or more identical strings of consecutive alphanumeric (only) characters are stored in one structure, thus they share their memory reference. There are posts about this phenomenon all over the internet since the 1990's. It has evidently always been that way. I have never seen a reasonable guess as to why that's the case. I only know that it is. Furthermore, if you split and re-join alphanumeric strings to remove spaces between words, the resulting identical alphanumeric strings do NOT share a reference, which I find odd. See below:

Add any non-alphanumeric value identically to both strings, and they instantly become copies, but not shared references.

a ="abbacca";  b = "abbacca";  a is b => True
a ="abbacca "; b = "abbacca "; a is b => False
a ="abbacca?"; b = "abbacca?"; a is b => False

~Dr. C.

Dr. C.
  • 76
  • 6
-1

'is' operator compare the actual object.

c is d should also be false. My guess is that python make some optimization and in that case, it is the same object.

creack
  • 116,210
  • 12
  • 97
  • 73
  • CPython keeps a pool of objects that are used frequently - short string literals and primitives like ints in the range 1-100. There is no reason to assume `c is d` should be false. – Elazar May 26 '13 at 06:24
  • 1
    But on the other hand, one should *not* assume that `c is d` is true. – poke May 26 '13 at 06:26
  • @Elazar my point exactly. The pool is an optimization and strings with space simply are out of the scope. @poke is righ, one should not assume that `c is d` – creack May 26 '13 at 06:27
  • @jamylak I meant "at least", I did not know the exact range... thanks. – Elazar May 26 '13 at 06:29
  • 2
    @creack obviously the OP is asking about these implementation details exactly. I assume he knows not to use them. – Elazar May 26 '13 at 06:30