How does Python determine if two strings are identical

Question

I've tried to understand when Python strings are identical (aka sharing the same memory location). However during my tests, there seems to be no obvious explanation when two string variables that are equal share the same memory:

import sys
print(sys.version) # 3.4.3

# Example 1
s1 = "Hello"
s2 = "Hello"
print(id(s1) == id(s2)) # True

# Example 2
s1 = "Hello" * 3
s2 = "Hello" * 3
print(id(s1) == id(s2)) # True

# Example 3
i = 3
s1 = "Hello" * i
s2 = "Hello" * i
print(id(s1) == id(s2)) # False

# Example 4
s1 = "HelloHelloHelloHelloHello"
s2 = "HelloHelloHelloHelloHello"
print(id(s1) == id(s2)) # True

# Example 5
s1 = "Hello" * 5
s2 = "Hello" * 5
print(id(s1) == id(s2)) # False

Strings are immutable, and as far as I know Python tries to re-use existing immutable objects, by having other variables point to them instead of creating new objects in memory with the same value.

With this in mind, it seems obvious that Example 1 returns True.
It's still obvious (to me) that Example 2 returns True.

It's not obvious to me, that Example 3 returns False - am I not doing the same thing as in Example 2?!?

I stumbled upon this SO question:
Why does comparing strings in Python using either '==' or 'is' sometimes produce a different result?

and read through http://guilload.com/python-string-interning/ (though I probably didn't understand it all) and thougt - okay, maybe "interned" strings depend on the length, so I used HelloHelloHelloHelloHello in Example 4. The result was True.

And what the puzzled me, was doing the same as in Example 2 just with a bigger number (but it would effectively return the same string as Example 4) - however this time the result was False?!?

I have really no idea how Python decides whether or not to use the same memory object, or when to create a new one.

Are the any official sources that can explain this behavior?

I don't understand why example 2 and 5 should give different results. — Mohit Motwani, Mar 22 '18 at 11:13
Why do you care? If you want to enforce this then just intern the strings with `sys.intern()`, otherwise the rules are implementation specific as you know, and therefore not documented by Python — Chris_Rands, Mar 22 '18 at 11:13
The second link that you give goes into great detail. It is unlikely that a Stack Overflow answer will be better than that link. The last sentence in your question is actually off-topic (although that doubtless wasn't your intention). — John Coleman, Mar 22 '18 at 11:15
Possible duplicate of [Python string interning](https://stackoverflow.com/questions/15541404/python-string-interning) — Chris_Rands, Mar 22 '18 at 11:21
The guilload blog post you linked has the answer to your question: "you have to admit that it is hard to tell on which basis a string is natively interned or not. So let’s read some more CPython code and figure it out!" — Stop harming Monica, Mar 22 '18 at 11:29
Possible duplicate of [When does python compile the constant string letters, to combine the constant strings to one constant string](https://stackoverflow.com/questions/14493236/when-does-python-compile-the-constant-string-letters-to-combine-the-constant-st) — jdehesa, Mar 22 '18 at 11:48

jdehesa · Accepted Answer · 2019-03-26T10:36:43.837

From the link you posted:

Avoiding large .pyc files

So why does 'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa' not evaluate to True? Do you remember the .pyc files you encounter in all your packages? Well, Python bytecode is stored in these files. What would happen if someone wrote something like this ['foo!'] * 10**9? The resulting .pyc file would be huge! In order to avoid this phenomena, sequences generated through peephole optimization are discarded if their length is superior to 20.

If you have the string "HelloHelloHelloHelloHello", Python will necessarily have to store it as it is (asking the interpreter to detect repeating patterns in a string to save space might be too much). However, when it comes to string values that can be computed at parsing time, such as "Hello" * 5, Python evaluate those as part of this so-called "peephole optimization", which can decide whether it is worth it or not to precompute the string. Since len("Hello" * 5) > 20, the interpreter leaves it as it is to avoid storing too many long strings.

EDIT:

As indicated in this question, you can check this on the source code in peephole.c, function fold_binops_on_constants, near the end you will see:

// ...
} else if (size > 20) {
    Py_DECREF(newconst);
    return -1;
}

EDIT 2:

Actually, that optimization code has recently been moved to the AST optimizer for Python 3.7, so now you would have to look into ast_opt.c, function fold_binop, which calls now function safe_multiply, which checks that the string is no longer than MAX_STR_SIZE, newly defined as 4096. So it seems that the limit has been significantly bumped up for the next releases.

SaiPawan · Answer 2 · 2018-03-23T11:03:41.443

In Example 2 :

# Example 2
s1 = "Hello" * 3
s2 = "Hello" * 3
print(id(s1) == id(s2)) # True

Here the value of s1 and s2 are evaluated at compile time.so this will return true.

In Example 3 :

# Example 3
i = 3
s1 = "Hello" * i
s2 = "Hello" * i
print(id(s1) == id(s2)) # False

Here the values of s1 and s2 are evaluated at runtime and the result is not interned automatically so this will return false.This is to avoid excess memory allocation by creating "HelloHelloHello" string at runtime itself.

If you do intern manually it will return True

i = 3
s1 = "Hello" * i
s2 = "Hello" * i
print(id(intern(s1)) == id(intern(s2))) # True

How does Python determine if two strings are identical

2 Answers2

Linked

Related