5

I know someone explain why when I create equal unicode strings in Python 2.7 they do not point to the same location in memory As in "normal" strings

>>> a1 = 'a'
>>> a2 = 'a'
>>> a1 is a2
True

ok that was what I expected, but

>>> ua1 = u'a'
>>> ua2 = u'a'
>>> ua1 is ua2
False

why? how?

Zokis
  • 390
  • 6
  • 12
  • 2
    See http://stackoverflow.com/questions/10622472/when-does-python-choose-to-intern-a-string for more info. But the short version is: normal strings _may_ be interned, but are not guaranteed to be. When they are is complicated, version-specific, and intentionally not documented. So you shouldn't rely on it for anything. – abarnert Mar 13 '13 at 18:56
  • Thanks, no I will not trust in "is" (only "is None") it was just curiosity internal implementation of python so I can compare with Java – Zokis Mar 13 '13 at 18:59
  • You can use `is` for any custom class of your own and for singletons (`None`, the empty tuple `()`, etc.). `int` and short string values may be interned though, use the equality test (`==`) for those. – Martijn Pieters Mar 13 '13 at 19:05
  • 2
    `is` is useful for other types besides `None`. For example, it's pretty common practice to create a `sentinel = object()` to use as a default parameter value, "end of queue" marker, etc., which you check with `is`. – abarnert Mar 13 '13 at 19:08

2 Answers2

3

I think regular strings are interned but unicode strings are not. This simple test seems to support my theory (Python 2.6.6):

>>> intern("string")
'string'
>>> intern(u"unicode string")

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    intern(u"unicode string")
TypeError: intern() argument 1 must be string, not unicode
Claudiu
  • 224,032
  • 165
  • 485
  • 680
  • 3
    they are all immutable, but strings are cached to be used faster internally (like in class dicts), whereas unicode strings aren't, apparently. – Claudiu Mar 13 '13 at 18:52
  • ah yes, thanks for the reply, but in python 3? that is entirely unicode – Zokis Mar 13 '13 at 18:55
  • @MarceloTambalo: In Python 3.x, `str` (== `unicode)` may be interned. But, as with `str` (== `bytes`) in 2.x, they're not guaranteed to be. – abarnert Mar 13 '13 at 18:57
  • @MarceloTambalo: in Python 3, for `"hi" is "hi"`, I get `True`, so it seems like it does. There is no `u"hi"` syntax in Python 3, because it's all unicode by default. – Claudiu Mar 13 '13 at 18:59
  • @Claudiu: There _is_ a `u"hi"` in 3 (as of… either 3.3 or 3.2), which is identical to `"hi"`. Just as `b"hi"` is identical to `"hi"` in 2.7. – abarnert Mar 13 '13 at 18:59
  • @abarnert: oh my mistake. The online interpreter I tried rejected it; it must have been an earlier version. – Claudiu Mar 13 '13 at 19:11
  • u"hi" is back on 3.3 only for backward compatibility syntax with python 2.x series. (u"hi" fails on 3.0 3.1 & 3.2) – Toilal Feb 13 '14 at 06:10
2

Normal strings are not guaranteed to be interned. Sometimes they are, sometimes they aren't. The rules are complicated, version-specific, and intentionally not documented.

You can depend on the fact that Python tries to intern small-ish, commonly-used objects whenever it's a good idea. And that, if you write any code that depends on either a1 is a2 or the converse, it will break whenever it's most inconvenient.

If you want any more than this, you have to look at the source for whichever version of whichever implementation you're interested in. For CPython, the details are mostly inside stringobject.c for 2.6 and 2.7, unicodeobject.c for 3.3.

The latter file of course also exists in 2.x (where it still defines the unicode type, that's just not the same as the str type as in 3.x). You can see from the 2.7 source that there is some support for interning unicode strings, even if you can't call intern on them. From a quick glance, it looks like 2.7 can handle interned unicode strings, but won't ever create them.

Meanwhile, 3.3 makes things even more fun, as a str object can point at UTF-8, UTF-16, or UTF-32 storage, which might be interned, but code that uses the old-style Unicode APIs may still end up with a new copy. So, even if a1 is a2, if you try to get at their characters, they may have different buffers.

When does python choose to intern a string has some more insight into the details. But again, the source is all that matters.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671