6

I was confused today by a string comparison: it seems python reuses strings (which is a sensible thing to do, since they are immutable). To check this fact I did the following:

>>> a = 'xxx'
>>> b = 'xxx'
>>> a == b
True
>>> a is b
True
>>> id(a)
140141339783816
>>> id(b)
140141339783816
>>> c = 'x' * 3
>>> id(c)
140141339783816
>>> d = ''.join(['x', 'x', 'x'])
>>> id(d)
140141339704576

Which is a bit surprising. some questions:

  • Does python check the whole content of its string table when defining new strings?
  • Is there a limit to the string size?
  • How does this mechanism work (comparing the hashes of the strings?)
  • It does not seem to be used for all kind of generated strings though. What is the rule here?
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
blueFast
  • 41,341
  • 63
  • 198
  • 344
  • `is` tests identity, i.e. memory locations. `==` tests equality. It is not wise to use them interchangeably, as some strings, ints, etc are interned in the name of optimization – inspectorG4dget Sep 05 '14 at 04:43
  • Thanks, but I already know that (I have not asked anything about it). My question is not about `is` versus `==`: it is about how internally python reuses strings. That is, about the internal implementation that python uses to decide that a string does not need to be added to its string table, but that it can be reused. As you can see in my example, that mechanism does not apply for all equal strings, so I would like to understand when and how it is used. – blueFast Sep 05 '14 at 04:51
  • 4
    You might be interested in [what Martijn Pieters has to say](https://www.codementor.io/python-tutorial/stack-overflow-martijn-pieters-python-optimization?utm_source=reddit-content&utm_medium=blog&utm_term=python-tutorial-python-internals&utm_content=blog&utm_campaign=reddit-content) – inspectorG4dget Sep 05 '14 at 04:55
  • @inspectorG4dget: that is exactly what I was looking for, thanks! Please add that answer and I accept. – blueFast Sep 05 '14 at 05:02
  • @jeckyll2hide the answer I think you're after is actually on the [linked dupe](http://stackoverflow.com/a/1504848/1252759)... Python will intern string literals that meet valid syntax to be an identifier or single characters as part of optimisation. `a = '$a'; b = '$b'; a is b` will be `False` for instance – Jon Clements Sep 05 '14 at 05:02
  • @JonClements: so, why does interning apply to `'x' * 3` but not to the `join` operation? The resulting string is the same, so it satisfies the same rules for interning ... The only difference is how that result was obtained. So, again: how does python decide this? – blueFast Sep 05 '14 at 05:07
  • Because `'x'*3` results in a string literal at compile time (ie... it becomes replaced with 'xxx'`) – Jon Clements Sep 05 '14 at 05:09
  • @JonClements: THAT is the answer! If I do: `k = 'x' ; j = k * 3 ` then `j` is not reused. I was confused about the fact that `'x' * 3` is indeed a literal. So I guess the rule is to reuse literals for interning *if certain conditions apply* (as explain by Martijn Pieters). – blueFast Sep 05 '14 at 05:13
  • Exactly, if Python can intern literal strings that are valid tokens at compile time, it will do so... so while you can generate strings at run-time with the same equality, it won't have the same identity. – Jon Clements Sep 05 '14 at 05:17
  • @jeckyll2hide: `'x' * 3` is replaced by `'xxx'` at compile time as a peephole optimisation. I closed this as a duplicate because the answer I wrote there covers all your questions (including mentioning the peephole optimisations). – Martijn Pieters Sep 15 '14 at 16:30

1 Answers1

0

As the question has some upvotes (although it is somewhat of a duplicate), I will answer here my original questions (thanks to the comments above):

  1. Yes, python checks the whole content of the internal table: but only for some strings, mostly those which can also be used as identifiers. The idea is that the speedup trickery used for identifier handling by the python interpreter (compiler?) is also useful for generic string handling. The process is called interning
  2. As far as I know there is no limit to string size, but there are other rules for strings to be reused (mostly: they must look like python identifiers)
  3. Yes, the table is a normal python dict, and the strings have a hash for lookups.
  4. It is only used for string literals and constant expressions. Basically for all things that the python interpreter can infer during the compile phase.

To clarify the last point, the following snippets evaluate in all cases to the string 'xxx', but they are treated differently regarding interning.

This is a constant expression:

'x' * 3

But this is not:

a = 'x'
a * 3   # this is no constant expression, so no interning can be applied.

And this is no expression:

''.join(['x', 'x', 'x']) # this is no expression (a function is called)
blueFast
  • 41,341
  • 63
  • 198
  • 344
  • This answer only adds to the confusion. `'x' * 3` is **not** a string literal, it is an expression. That the result is interned anyway is an implementation detail, an optimization done by the compiler that pre-calculates simple expressions such as `1 + 1` or `'x' * 3` at compile-time. In case of strings, the usual interning of strings that look like identifiers is applied. Expressions entered at the interactive prompt go through exactly the same compile-run path as do `.py` files. – user4815162342 Sep 05 '14 at 09:00
  • @user4815162342: so how do you explain that `a = 'x' ; a * 3` is not interned? That is also an expression, and the resulting string satisfies all requirements for interning ... – blueFast Sep 05 '14 at 09:19
  • It is also an expression, but not one that only involves constants. The current CPython compiler is not smart enough to prove that `a * 3` is in this case the same as `'x' * 3`. (In fact, compile-time folding of constants is a relatively recent addition to the compiler, implemented somewhere in the 2.3-2.5 time frame, if memory serves me.) – user4815162342 Sep 05 '14 at 09:28