18

I've noticed that adding a space to identical strings makes them compare unequal using is, while the non-space versions compare equal.

a = 'abc'
b = 'abc'
a is b
#outputs: True

a = 'abc abc'
b = 'abc abc'
a is b
#outputs: False

I have read this question about comparing strings with == and is. I think this is a different question because the space character is changing the behavior, not the length of the string. See:

a = 'abc'
b = 'abc'
a is b # True

a = 'gfhfghssrtjyhgjdagtaerjkdhhgffdhfdah'
b = 'gfhfghssrtjyhgjdagtaerjkdhhgffdhfdah'
a is b # True

Why does adding a space to the string change the result of this comparison?

Community
  • 1
  • 1
midkin
  • 1,503
  • 2
  • 18
  • 23
  • 1
    This question is worded differently, but the answers to the other question work here, as the key is recognizing what `is` does and how it differs from `==`. – DSM Feb 04 '15 at 19:15
  • 1
    There was a comment that got deleted referring you to the difference between "is" and "==". Is that your issue, or are you curious as to why the strings have the same identity if there is no space, but have different identities when there is a space character... Other non-printable characters also make the identities different. – hft Feb 04 '15 at 19:18
  • 2
    I did ask NOTHING like == or the differance betweeen == and is, sry – midkin Feb 04 '15 at 19:21
  • 1
    A good article on the topic: http://guilload.com/python-string-interning/. – georg Feb 04 '15 at 21:34

1 Answers1

32

The python interpreter caches some strings based on certain criteria, the first abc string is cached and used for both but the second is not. It is the same for small ints from -5 to 256.

Because the strings are interned/cached assigning a and b to "abc" makes a and b point to the same objects in memory so using is, which checks if two objects are actually the same object, returns True.

The second string abc abc is not cached so they are two entirely different object in memory so out identity check using is returns False. This time a is not b. They are both pointing to different objects in memory.

In [43]: a = "abc" # python caches abc
In [44]: b = "abc" # it reuses the object when assigning to b
In [45]: id(a)
Out[45]: 139806825858808    # same id's, same object in memory
In [46]: id(b)
Out[46]: 139806825858808    
In [47]: a = 'abc abc'   # not cached  
In [48]: id(a)
Out[48]: 139806688800984    
In [49]: b = 'abc abc'    
In [50]: id(b)         # different id's different objects
Out[50]: 139806688801208

The criteria for caching strings is if the string only has letters, underscores and numbers in the string so in your case the space does not meet the criteria.

Using the interpreter there is one case where you can end up pointing to the same object even when the string does not meet the above criteria, multiple assignments.

In [51]: a,b  = 'abc abc','abc abc'

In [52]: id(a)
Out[52]: 139806688801768

In [53]: id(b)
Out[53]: 139806688801768

In [54]: a is b
Out[54]: True

Looking codeobject.c source for deciding the criteria we see NAME_CHARS decides what can be interned:

#define NAME_CHARS \
    "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

/* all_name_chars(s): true iff all chars in s are valid NAME_CHARS */

static int
all_name_chars(unsigned char *s)
{
    static char ok_name_char[256];
    static unsigned char *name_chars = (unsigned char *)NAME_CHARS;

    if (ok_name_char[*name_chars] == 0) {
        unsigned char *p;
        for (p = name_chars; *p; p++)
            ok_name_char[*p] = 1;
    }
    while (*s) {
        if (ok_name_char[*s++] == 0)
            return 0;
    }
    return 1;
}

A string of length 0 or 1 will always be shared as we can see in the PyString_FromStringAndSize function in the stringobject.c source.

/* share short strings */
    if (size == 0) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        nullstring = op;
        Py_INCREF(op);
    } else if (size == 1 && str != NULL) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        characters[*str & UCHAR_MAX] = op;
        Py_INCREF(op);
    }
    return (PyObject *) op;
}

Not directly related to the question but for those interested PyCode_New also from the codeobject.c source shows how more strings are interned when building a codeobject once the strings meet the criteria in all_name_chars.

PyCodeObject *
PyCode_New(int argcount, int nlocals, int stacksize, int flags,
       PyObject *code, PyObject *consts, PyObject *names,
       PyObject *varnames, PyObject *freevars, PyObject *cellvars,
       PyObject *filename, PyObject *name, int firstlineno,
       PyObject *lnotab)
{
    PyCodeObject *co;
    Py_ssize_t i;
    /* Check argument types */
    if (argcount < 0 || nlocals < 0 ||
        code == NULL ||
        consts == NULL || !PyTuple_Check(consts) ||
        names == NULL || !PyTuple_Check(names) ||
        varnames == NULL || !PyTuple_Check(varnames) ||
        freevars == NULL || !PyTuple_Check(freevars) ||
        cellvars == NULL || !PyTuple_Check(cellvars) ||
        name == NULL || !PyString_Check(name) ||
        filename == NULL || !PyString_Check(filename) ||
        lnotab == NULL || !PyString_Check(lnotab) ||
        !PyObject_CheckReadBuffer(code)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    intern_strings(names);
    intern_strings(varnames);
    intern_strings(freevars);
    intern_strings(cellvars);
    /* Intern selected string constants */
    for (i = PyTuple_Size(consts); --i >= 0; ) {
        PyObject *v = PyTuple_GetItem(consts, i);
        if (!PyString_Check(v))
            continue;
        if (!all_name_chars((unsigned char *)PyString_AS_STRING(v)))
            continue;
        PyString_InternInPlace(&PyTuple_GET_ITEM(consts, i));
    }

This answer is based on simple assignments using the cpython interpreter, as far as interning in relation to functions or any other functionality outside of simple assignments, that was not asked nor answered.

If anyone with a greater understanding of c code has anything to add feel free to edit.

There is a much more thorough explanation here of the whole string interning.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • 1
    Your right about the numbers as a=260, b=260 ---> a is b is False... However what's the answer about my strings? – midkin Feb 04 '15 at 19:24
  • 1
    No since a='rgaerdavaerdgvrdversdvseDvaerdvreds', b=rgaerdavaerdgvrdversdvseDvaerdvreds is True.. its about the char SPACE and not about long or small – midkin Feb 04 '15 at 19:26
  • 2
    Note: the dup question **does** contain the information you are giving here, in the [second answer](http://stackoverflow.com/a/1504848/510937). – Bakuriu Feb 04 '15 at 19:34
  • 1
    So a string with SPACE doesn't get cached! Am I right? – midkin Feb 04 '15 at 19:35
  • 2
    @midkin It's *useless* to think about when that happens because there are *no guarantees*. We **cannot** say that if there's a space it doesn't get cached because it *may* happen now, and not in the next bugfix release. Different python implementations will use different ways of determining when to intern, and different versions of the same implementation are free to disagree in this regard. – Bakuriu Feb 04 '15 at 19:36
  • 1
    @Bakuriu I keep your answer as right! It just seems that in python's 3.4.2 idle string with SPACE are not getting cached! – midkin Feb 04 '15 at 19:41
  • 2
    I decided to downvote your answer because you are providing *false information*. If you want to add a paragraph as your last one you should either 1) provide a reference to some documentation that mentions that or 2) Providing a link to the source code where that is done *and* clearly mention that this is bound to change with implementations and versions. As it stands your answer is suggesting that such a rule exist and can be relied upon, **which is false** (at least, as far as I know. If I'm wrong prove it with a reference). – Bakuriu Feb 04 '15 at 19:42
  • 1
    @PadraicCunningham since I think your answer is "answering" most of what i needed to hear i ll choose it as my answer! Btw thanks all for your comments! – midkin Feb 04 '15 at 19:42
  • 3
    @PadraicCunningham That's not a reference from the documentation or any official document about the python language. Also: note that right after mentioning interning it mentions peephole optimizations which is exactly one point where what you say fails: put `a = 'abc abc'` and `b = 'abc abc'` inside a function and suddenly `a is b` returns `True`. – Bakuriu Feb 04 '15 at 19:47
  • 4
    @Bakuriu, multiple assignments are not single assignments so there is a distinct difference, a function is not mentioned anywhere, what is asked in the question and demonstrated in the answer is simple assignments, there is no mention of functions anywhere just an explanation for the question asked. – Padraic Cunningham Feb 04 '15 at 19:53
  • 1
    Multiple assignments are not the issue per se, the thing is that the REPL sees both strings _on the same line_, thus making a peephole optimization possible. For example `a='x y'; b='x y'` in REPL will cause `x y` to be folded and interned as well. – georg Feb 04 '15 at 21:38
  • 1
    @georgm yes, what I was trying to say was in the purest form when you create a simple single assignment one per line the string will be interned based on the criteria. The peephole optimisation is another part where the are certain limits on length etc.. but that is another topic. If you know of anything related that is not covered feel free to edit – Padraic Cunningham Feb 04 '15 at 21:52
  • 2
    @martijnpieters , I referenced your article in codementor in a comment before i edited after looking at the source, in your article it states that the string must start with an underscore or a letter but it seems that it can also start with a number, the only criteria is that it must be one of the three no? – Padraic Cunningham Feb 04 '15 at 22:23
  • 1
    Shouldn't the second comment in the first code block be 'it reuses the object when assigning to **b**'? – BioGeek Aug 21 '15 at 13:37
  • @BioGeek, yep, fixed. – Padraic Cunningham Aug 21 '15 at 14:01