-1

I'm not asking about the difference between == and is operators! I am asking about interning or something..!

In Python 3.9.1,

>>> str(1) is '1'
<stdin>:1: SyntaxWarning: "is" with a literal. Did you mean "=="?
False
>>> '1' is '1'
<stdin>:1: SyntaxWarning: "is" with a literal. Did you mean "=="?
True

I found out that characters which match [a-zA-Z0-9_] are interned in Python. I understand why '1' is '1'. Python stores a character '1' somewhere in the memory internally and refers to it whenever '1' is called. And str(1) returns '1', and I think, it should refers to the same address as other literal '1's. Shouldn't str(1) is '1' also be True?

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
dkssud
  • 125
  • 2
  • 2
    All string literals are interned, i.e. `"foobar!?" is "foobar!?"` is also `True`. `str(1)` is not a _literal_ therefore it is not interned. – Selcuk Sep 02 '21 at 06:04
  • 2
    Nope, ```is``` is used to compare ```references``` rather than the contents of the objects. If you have two different ```string``` objects, having the same contents. ```is``` will give false, but ```==``` will give true. ```is``` has nothing special to do with types. – Muhteva Sep 02 '21 at 06:05
  • 2
    @MHP That's incorrect/missing information. – Selcuk Sep 02 '21 at 06:07
  • [python-string-interning](https://stackoverflow.com/questions/15541404/python-string-interning) – Patrick Artner Sep 02 '21 at 06:07
  • 1
    @Muhteva The OP is asking about interning, they seem to know the difference between `is` and `==`. Python makes that difference a bit vague in some cases because of compiler optimisations. For example after `a = "12"` and `b = "1" + "2"`, `a is b` will (surprisingly) return `True`. – Selcuk Sep 02 '21 at 06:10
  • 1
    I know, I just wanted to point out that argument of @MHP is false. – Muhteva Sep 02 '21 at 06:11
  • @Selcuk - good example. its the compiler that figures out "1" + "2" and makes the string literal that is interned. – tdelaney Sep 02 '21 at 06:13
  • See https://stackoverflow.com/questions/17679861/does-python-intern-strings – tdelaney Sep 02 '21 at 06:15
  • Notes on interning in the `intern` function docs: https://docs.python.org/3/library/sys.html#sys.intern – tdelaney Sep 02 '21 at 06:17
  • "Shouldn't str(1) is '1' also be True?" No, there is no reason for you to expect that. Any interning is an implementation detail anyway – juanpa.arrivillaga Sep 02 '21 at 06:41

3 Answers3

0

Since str() method is interpreted and run at the runtime (not at the compile time), python compiler does not know what str(1) is gonna be equal to. But basic string concatenation or definition of variable runs at the compile time, so the optimisation (string interning) can be made. See the below examples:

>>> a = '12'
>>> b = '1' + '2'
>>> c = str(12)
>>> d = ''.join(['1', '2'])

>>> a is b
True
>>> a is c
False
>>> b is c
False
>>> a is d
False
Rustam Garayev
  • 2,632
  • 1
  • 9
  • 13
0

I found out that characters which match [a-zA-Z0-9_] are interned in Python

You "found out" wrong I fear.

Python automatically interns string literals as well as symbol names (e.g. module names, class names, method names, ...).

And str(1) returns '1', and I think, it should refers to the same address as other literal '1's. Shouldn't str(1) is '1' also be True?

No. Interning is an explicit deduplication operation, and CPython only exposes an API to intern a string, there is not currently any way to intern a string if there's already an interned version without doing messing with the interpreter guts by hand.

This means if the output of str() could be interned (by str itself) it always would be, which is probably undesitable: while CPython does not "leak" interned strings (they don't leave forever), increasing the size of the interning map adds to its maintenance and management costs .

Masklinn
  • 34,759
  • 3
  • 38
  • 57
0

is checks for references, not content. Also, str(1) is not a literal therefore it is not interned.

But '1' is interned because it's directly a string. Whereas str(1) goes through a process to become a string. As you can see:

>>> a = '1'
>>> b = str(1)
>>> a
'1'
>>> b
'1'
>>> a is b
False
>>> id(a)
1603954028784
>>> id(b)
1604083776304
>>>

So the way to make them both interned is with sys.intern:

>>> import sys
>>> a = '1'
>>> b = str(1)
>>> a is b
False
>>> a is sys.intern(b)
True
>>> 

As mentioned in the docs:

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

Note that in Python 2 intern() was a built-in keyword, but now in python 3 it was merged into the sys module to become sys.intern

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • `i = 1; '012'[i] is '1'` also goes through a process but gives me `True`. – no comment Sep 02 '21 at 06:30
  • @don'ttalkjustcode Well yes it's because indexing would feed the code into the `__getitem__` magic method where the str class controls, same with `+` – U13-Forward Sep 02 '21 at 06:31
  • Same with what `+`? – no comment Sep 02 '21 at 06:35
  • @don'ttalkjustcode It goes into the `__add__` magic method – U13-Forward Sep 02 '21 at 06:36
  • @don'ttalkjustcode It still would make it remain in the same class instance so the references would be equivalent, whereas calling like `str(1)` would start a completely new class instance. – U13-Forward Sep 02 '21 at 06:37
  • @don'ttalkjustcode if you're doing `'1' + '1'` that actually gets constant-folded to `'11'` during bytecode compilation. – Masklinn Sep 02 '21 at 07:13
  • @don'ttalkjustcode as for indexing, I assume that's a specifically implemented behaviour (as single chars from strings would usually be used a lot): it happens for indexing and single-char slices, but not for slices of length 2+. – Masklinn Sep 02 '21 at 07:23
  • @Masklinn Yes, I know about that `+` during compilation. If that's what U12 meant, it's not comparable to my runtime one, that's why I asked. And yes, sounds like a good optimization to do. Good idea to test slices of 1 or 2. – no comment Sep 02 '21 at 11:22
  • @U12-Forward What would "remain in the same class instance"? – no comment Sep 02 '21 at 11:22
  • @don'ttalkjustcode ah I guess I misunderstood your comment about `+`, sorry about that. – Masklinn Sep 02 '21 at 11:24
  • @Masklinn No worries. Good idea with the slices, and [apparently](https://tio.run/##fcqxDsIgFIXhvU9xtoITtNEYYp@EMDSK9i5A7mXQp0ds4uLgP57zlVfdcprPhVu7ZwaBEnhNj6jM01rT025AT7DgurEijQPmfVpFIlco8SaABOJt0FgWKMIF0/Gkf5mzX@imP1R8vz8w5brjMBSmVNV4yymOurU3) this optimization is done for characters 0-255. – no comment Sep 02 '21 at 11:42
  • @don'ttalkjustcode makes sense, doing it for the entire unicode space seemed a bit much but I didn't test if the range was more restricted e.g. to just ascii or extended ascii. With that hint, it looks like those actually have their own special case just for them: when creating a string from a single byte (`Py_UCS1`), which is what `str.__getitem__` does internally but apparently not `str(1)`, the byte is looked up in the `latin1` array of the interpreter state, not even in the separate "interned string" cache. – Masklinn Sep 02 '21 at 12:00