string identity comparison in CPython

Question

I have recently discovered a potential bug in a production system where two strings were compared using the identity operator, eg:

if val[2] is not 's':

I imagine this will however often work anyway, because as far as I know CPython stores the short immutable strings in the same location. I've replaced it with !=, but I need to confirm that the data that previously went through this code is correct, so I'd like to know if this always worked, or if it only sometimes worked.

The Python version has always been 2.6.6 as far as I know and the above code seems to be the only place where the is operator was used.

Does anyone know if this line will always work as the programmer intended?

edit: Because this is no doubt very specific and unhelpful to future readers, I'll ask a different question:

Where should I look to confirm with absolute certainty the behaviour of the Python implementation? Are the optimisations in CPython's source code easy to digest? Any tips?

See http://stackoverflow.com/questions/2858603/python-why-is-keyword-has-different-behavior-when-there-is-dot-in-the-string/2858680#2858680 for Alex Martelli's dive into implementation details. The short answer is of course "No, there is no guarantee that this is consistent". From a practical standpoint, it is likely that with such a short string literal, that automatic interning would work as you expect though. However, I have some doubts about strings constructed in C extensions - those might not be subject to the same automatic interning heuristics. — Jeremy Brown, Jan 31 '11 at 17:18
Since even `chr(115) is 's'` works, for which looking up the string is more of a performance hit than anything, I'd guess the lookup happens automatically below a certain length. — Jochen Ritzel, Jan 31 '11 at 17:33
@David This code was used when processing data to be saved in a database. If it is sometimes incorrect, then some of the saved data is also incorrect. There are a few million values in the database and it would be impossible to tell which ones were saved correctly and which ones were not. — Will Hardy, Jan 31 '11 at 17:44
@Jeremy, @Jochen thanks, that's very helpful. I think I have to read through the CPython source for my particular version and see if the optimisation is always there for single characters. — Will Hardy, Jan 31 '11 at 17:49

score 3 · Answer 1 · answered Jan 31 '11 at 16:53

3

You are certainly not supposed to use the is/is not operator when you just want to compare two objects without checking if those objects are the same.

While it makes sense that python never creates a new string object with the same contents as an existing one (since strings are immutable) and equality and identity are equivalent due to this, I wouldn't rely on that, especially with the tons of python implementations out there.

answered Jan 31 '11 at 16:53

ThiefMaster

310,957
84
592
636

1

I didn't use it, and we certainly won't be using it anymore. But I need to know if this worked anyway or not. If not, then the data are wrong and we need to do something about that and this will be really difficult. – Will Hardy Jan 31 '11 at 16:59
I think in CPython it works because it re-uses string objects. No idea about Jython etc. – ThiefMaster Jan 31 '11 at 17:02
That's what I thought, can anyone confirm this with certainty? – Will Hardy Jan 31 '11 at 17:07

score 3 · Accepted Answer · answered Jan 31 '11 at 17:56

You can look at the CPython code for 2.6.x: http://svn.python.org/projects/python/branches/release26-maint/Objects/stringobject.c

It looks like one-character strings are treated specially, and each distinct one exists only once, so your code is safe. Here's some key code (excerpted):

static PyStringObject *characters[UCHAR_MAX + 1];

PyObject *
PyString_FromStringAndSize(const char *str, Py_ssize_t size)
{
    register PyStringObject *op;
    if (size == 1 && str != NULL &&
        (op = characters[*str & UCHAR_MAX]) != NULL)
    {
        Py_INCREF(op);
        return (PyObject *)op;
    }

...

Thanks for that! I was wading through the CPython source code, but wouldn't have been able to recognise where this was being handled. Cheers! — Will Hardy, Jan 31 '11 at 19:50

Joe Kington · Answer 3 · 2011-01-31T18:13:31.493

As people have already noted, it should always be true for strings created in python (or CPython, anyway), but if you're using a C extension, it won't be the case.

As a quick counter-example:

import numpy as np

x = 's'
y = np.array(['s'], dtype='|S1')

print x
print y[0]

print 'x is y[0] -->', x is y[0]
print 'x == y[0] -->', x == y[0]

This yields:

s
s
x is y[0] --> False
x == y[0] --> True

Of course, if nothing ever used a C extension of any sort, you're probably safe... I wouldn't count on it, though...

Edit: As an even simpler example, it doesn't hold if things have been pickled or packed with struct in any way.

e.g.:

import pickle
x = 's'
pickle.dump(x, file('test', 'w'))
y = pickle.load(file('test', 'r'))

print x is y
print x == y

Also (Using a different letter for clarity, as we need "s" for the formatting string):

import struct
x = 'a'
y = struct.pack('s', x)

print x is y
print x == y

Thanks for that, I'm only dealing with strings so all good here, but I'm sure many others will come across this useful information. — Will Hardy, Jan 31 '11 at 19:49

score 2 · Answer 4 · answered Jan 31 '11 at 17:51

This behavior will always be apply for empty and single character latin-1 strings. From unicodeobject.c:

PyObject *PyUnicode_FromUnicode(const Py_UNICODE *u,
                                Py_ssize_t size)
{
.....
    /* Single character Unicode objects in the Latin-1 range are
       shared when using this constructor */
    if (size == 1 && *u < 256) {
        unicode = unicode_latin1[*u];

This snippet is from Python 3 but it's likely a similar optimization exists in earlier versions.

Wonderful, thanks! Looking through Python2.6 source code now :-) — Will Hardy, Jan 31 '11 at 19:40

score 0 · Answer 5 · answered Jun 28 '13 at 12:25

Granted it works due to automatic short string interning (same as constants in python source, as literal 's' is), but it's quite silly to use identity here.

Python is about duck typing, any object that looks like a string could be used, for example the same code fails if val[2] is actually u"s".

string identity comparison in CPython

5 Answers5