Python string identity: `is` and `in` statements

Question

I had some problems getting this to work:

# Shortened for brevity
def _coerce_truth(word):
    TRUE_VALUES = ('true','1','yes')
    FALSE_VALUES = ('false','0','no')

    _word = word.lower().strip()
    print "t" in _word
    if _word in TRUE_VALUES:
        return True
    elif _word in FALSE_VALUES:
        return False

I discovered:

In [20]: "foo" is "Foo".lower()
Out[20]: False

In [21]: "foo" is "foo".lower()
Out[21]: False

In [22]: "foo" is "foo"
Out[22]: True

In [23]: "foo" is "foo".lower()
Out[23]: False

Why is this? I understand that identity is different then equality, but when is identity formed? Statement 22 should be False unless, due to the static nature of strings, id == eq. In this case I'm confused by statement 23.

Please explain and thanks in advance.

This is about optimisation - as strings are immutable Python will automatically share instances of strings in your source code - they will be the same object. There is no guarantee of this, however. It should definitely not be relied upon. — Gareth Latty, Sep 19 '13 at 20:45
Can you clarify how `is` and `in` are behaving relative to each other, and what you expected? — Josh Lee, Sep 19 '13 at 20:49
It's very rare that you want to use `is` in Python. Use `==` instead. — Russell Borogove, Sep 20 '13 at 05:02

score 7 · Answer 1 · edited Sep 20 '13 at 00:03

7

Q. "When is identity formed?"

A. When the object is created.

What you're seeing is actually an implementation detail of Cpython -- It caches small strings and reuses them for efficiency. Other cases that are interesting are:

"foo" is "foo".strip()  # True
"foo" is "foo"[:]       # True

Ultimately, what we see is that the string literal "foo" has been cached. Every time you type "foo", you're referencing the same object in memory. However, some string methods will choose to always create new objects (like .lower()) and some will smartly re-use the input string if the method made no changes (like .strip()).

One benefit of this is that string equality can be implemented by a pointer compare (blazingly fast) followed by a character-by-character comparison if the pointer comparison is false. If the pointer comparison is True, then the character-by-character comparison can be avoided.

edited Sep 20 '13 at 00:03

dawg

98,345
23
131
206

answered Sep 19 '13 at 20:45

mgilson

300,191
65
633
696

I’m seeing `'foo' is 'foo'.lower()` is `True` in Python 3. – Josh Lee Sep 19 '13 at 20:48
Probably because `lower` is implemented as an in-place list operation in Python3. It's not in Python2. – Shashank Sep 19 '13 at 20:51
1

@JoshLee -- Well, it looks like implementation details hold true to their name then :). The take-away is that if it's implementation dependent, then you shouldn't rely on that behavior since it's likely to be different for you compared to me. – mgilson Sep 19 '13 at 20:53

score 3 · Answer 2 · answered Sep 19 '13 at 21:02

3

As for relation between is and in:

The __contains__ method (which stands behind in operator) for tuple and list while looking for a match, first checks the identity and if that fails checks for equality. This gives you sane results even with objects that don't compare equal to themselves:

>>> x = float("NaN")
>>> t = (1, 2, x)
>>> x in (t)
True
>>> any(x == e for e in t) # this might be suprising
False

answered Sep 19 '13 at 21:02

lqc

7,434
1
25
25

1

The properties of `nan` not being equal to itself is irrelevant to the question. `float('nan')==float('nan')` is always False – Sep 19 '13 at 23:25
@Pyson: NaN is only an example why ``in`` needs to check for identity instead of just checking for equality. – lqc Sep 20 '13 at 07:50
Most specifically, it *first* checks for identity, *then* equality if identity turns out `False`. – mgilson Sep 20 '13 at 16:05

dawg · Answer 3 · 2013-09-19T23:36:16.870

Suppose you have a bunch of ways of creating objects that have the string value of 'foo':

def f(s): return s.lower()

li=['foo',                     # 0
    'foo'.lower(),             # 1
    'foo'.strip(),             # 2
    'foo',                     # 3
    'f'+'o'*2,                 # 4
    '{}'.format('foo'),        # 5
    'f'+'o'+'o',               # 6
    intern('foo'.lower()),     # 7
    'foo'.upper().lower(),     # 8
    f('foo'),                  # 9
    'FOO'.lower(),             # 10
    'foo',                     # 11
    format('foo'),             # 12
    '%s'%'foo',                # 13
    '%s%s'%('fo','o'),         # 14
    'f'   'oo'                 # 15
]

is will only resolve to True if the objects have the same id. You can test this and see which string objects have been interned into the same immutable string at runtime:

def cat(l):
    d={}
    for i,e in enumerate(l):
        k=id(e)
        d.setdefault(k,[]).append(i)

    return '\n'.join(str((k,d[k])) for k in sorted(d))

Prints:

(4299781024, [0, 2, 3, 6, 7, 11, 12, 13, 15])
(4299781184, [5])
(4299781784, [1])
(4299781864, [4])
(4299781904, [9])
(4299782944, [8])
(4299783064, [10])
(4299783144, [14])

You can see that most resolve (or are interned) to the same string object, but some do not. Which is implementation dependent.

You can make them be all the same string object by using the function intern:

print cat([intern(s) for s in li])

Prints:

(4299781024, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

I'm not sure how this helps. If you change it to `'foo'.strip()`, you get pretty much the same `dis` output, but different results. Is your point the `LOAD_CONST` part? — mgilson, Sep 19 '13 at 20:55

Python string identity: `is` and `in` statements

3 Answers3