10

I have searched the web and stack overflow questions but been unable to find an answer to this question. The observation that I've made is that in Python 2.7.3, if you assign two variables the same single character string, e.g.

>>> a = 'a'
>>> b = 'a'
>>> c = ' '
>>> d = ' '

Then the variables will share the same reference:

>>> a is b
True
>>> c is d
True

This is also true for some longer strings:

>>> a = 'abc'
>>> b = 'abc'
>>> a is b
True
>>> '  ' is '  '
True
>>> ' ' * 1 is ' ' * 1
True

However, there are a lot of cases where the reference is (unexpectantly) not shared:

>>> a = 'a c'
>>> b = 'a c'
>>> a is b
False
>>> c = '  '
>>> d = '  '
>>> c is d
False
>>> ' ' * 2 is ' ' * 2
False

Can someone please explain the reason for this?

I suspect there might be simplifications/substitutions made by the interpreter and/or some caching mechanism that makes use of the fact that python strings are immutable to optimize in some special cases, but what do I know? I tried making deep copies of strings using the str constructor and the copy.deepcopy function but the strings still inconsistently share references.

The reason I'm having problems with this is because I check for inequality of references to strings in some unit tests I'm writing for clone methods of new-style python classes.

EriF89
  • 730
  • 7
  • 12

3 Answers3

9

The details of when strings are cached and reused are implementation-dependent, can change from Python version to Python version and cannot be relied upon. If you want to check strings for equality, use ==, not is.

In CPython (the most commonly-used Python implementation), string literals that occur in the source code are always interned, so if the same string literal occurs twice in the source code, they will end up pointing to the same string object. In Python 2.x, you can also call the built-in function intern() to force that a particular string is interned, but you actually shouldn't do so.

Edit regarding you actual aim of checking whether attributes are improperly shared between instances: This kind of check is only useful for mutable objects. For attributes of immutable type, there is no semantic difference between shared and unshared objects. You could exclude immutable types from your tests by using

Immutable = basestring, tuple, numbers.Number, frozenset
# ...
if not isinstance(x, Immutable):    # Exclude types known to be immutable

Note that this would also exclude tuples that contain mutable objects. If you wanted to test those, you would need to recursively descend into tuples.

Cristian Ciupitu
  • 20,270
  • 7
  • 50
  • 76
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • Thank you for your answer, it seems that you are correct regarding that string literals in source code are always interned (strings which are not literals do not have this property). Hence, my examples are only valid when writing in the python interactive interpreter. I am using CPython and your answer confirms some of my suspicions. I agree that the == operator should be used to check strings for equality. Is there any elegant way of checking where an attribute was initialized or should I just avoid this kind of testing? – EriF89 Jul 23 '12 at 11:59
  • @EriF89: What does it mean to check "where an attribute was initialized"? And how does it matter? – Sven Marnach Jul 23 '12 at 12:09
  • Since I'm writing tests to make sure that attribute references are not improperly shared between cloned objects, I would like a generic way of checking that changes in one of the objects does not leak through to the other. I thought the is operator would do this but since the references are shared, I guess I have to assign new values to attributes of one of the objects, and check that they didn't change in the other, instead. – EriF89 Jul 23 '12 at 12:21
  • 1
    @EriF89: If you think you need this kind of test, restrict it to mutable types. If the attributes are of some type known to be immutable (e.g. `basestring`, `tuple`, `numbers.Number`, `frozenset`), there is no point in checking whether items are shared or not. It does not matter. – Sven Marnach Jul 23 '12 at 12:52
5

I think it's an implementation and optimization thing. If the string are short, they can (and are often?) "shared", but you can't depend on that. Once you have longer strings, you can see that they are not the same.

In [2]: s1 = 'abc'
In [3]: s2 = 'abc'

In [4]: s1 is s2
Out[4]: True

longer strings

In [5]: s1 = 'abc this is much longer'
In [6]: s2 = 'abc this is much longer'

In [7]: s1 is s2
Out[7]: False

use == to compare strings (and not the is operator).

--

OP's observation/hypothesis (in the comments below) that this may be due to the number of tokens seems to be supported by the following:

In [12]: s1 = 'a b c'
In [13]: s2 = 'a b c'

In [14]: s1 is s2
Out[14]: False

if compared with the initial example of abc above.

Levon
  • 138,105
  • 33
  • 200
  • 191
  • To me, it seems like the number of tokens (sequences of alphanumerics separated by strings or other symbols) in the string matters: In [1] s1 = 'abcthisismuchlonger' In [2] s2 = 'abcthisismuchlonger' In [3] s1 is s2 Out [3] True If there is more than one token, the string is not shared. – EriF89 Jul 23 '12 at 11:40
  • @EriF89 It is possible, I have never really looked into the details of how this is implemented, I just try to remember to use `==` when comparing strings and not depend on the storage between "identical" strings being shared. Your hypothesis about how this is decided makes sense though, I just tried with `'a b c'` and it came out as `False`. – Levon Jul 23 '12 at 11:44
5

In CPython, as an implementation detail the empty string is shared, as are single-character strings whose codepoint is in the Latin-1 range. You should not depend on this, as it is possible to bypass this feature.

You can request a string to be interned using sys.intern; this will happen automatically in some cases:

Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

sys.intern is exposed so that you can use it (after profiling!) for performance:

Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare.

Note that intern is a builtin in Python 2.

ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • 1
    +1 for relevant source code link. Interesting, although the details on interning strings for optimizing dictionary lookups are not relevant to me at this moment. – EriF89 Jul 23 '12 at 12:08