Strings from `raw_input()` in memory

Question

I've known for a while that Python likes to reuse strings in memory instead of having duplicates:

>>> a = "test"
>>> id(a)
36910184L
>>> b = "test"
>>> id(b)
36910184L

However, I recently discovered that the string returned from raw_input() does not follow that typical optimization pattern:

>>> a = "test"
>>> id(a)
36910184L
>>> c = raw_input()
test
>>> id(c)
45582816L

I curious why this is the case? Is there a technical reason?

mgilson · Accepted Answer · 2013-02-16T02:04:03.547

3

To me it appears that python interns string literals, but strings which are created via some other process don't get interned:

>>> s = 'ab'
>>> id(s)
952080
>>> g = 'a' if True else 'c'
>>> g += 'b'
>>> g
'ab'
>>> id(g)
951336

Of course, raw_input is creating new strings without using string literals, so it's quite feasible to assume that it won't have the same id. There are (at least) two reasons why C-python interns strings -- memory (you can save a bunch if you don't store a whole bunch of copies of the same thing) and resolution for hash collisions. If 2 strings hash to the same value (in a dictionary lookup for instance), then python needs to check to make sure that both strings are equivalent. It can do a string compare if they're not interned, but if they are interned, it only needs to do a pointer compare which is a bit more efficient.

edited Feb 16 '13 at 02:04

answered Feb 16 '13 at 01:35

mgilson

300,191
65
633
696

@PauloScardine -- Likely. I'm not an expert in the C source -- I only go spelunking once in a while :) – mgilson Feb 16 '13 at 02:23
Seems like cpython always intern 0-1 byte sized strings, no matter how they are created: `s = 'a'; x = 'ab'; g = x[0]; print id(s), id(g)` – Paulo Scardine Feb 16 '13 at 03:39
@PauloScardine -- That's not surprising. There's only 256 of them, so it's not much overhead -- Similar to how it interns small ints. (-5 to 255 I believe). – mgilson Feb 16 '13 at 04:19

score 2 · Answer 2 · answered Feb 16 '13 at 01:35

2

The compiler can't intern strings except where they're present in actual source code (ie, string literals). In addition to that, raw_input also strips off new lines.

answered Feb 16 '13 at 01:35

Jon Clements

138,671
33
247
280

Also, Python doesn't intern all strings. – elssar Feb 16 '13 at 01:36
@elssar indeed - and I seem to recall a lot of debate about manually using `intern` (but that was *years* ago) – Jon Clements Feb 16 '13 at 01:38
1

@JonClements -- Apparently you can manually intern strings to gain a small speed boost in the event of hash collisions in a dictionary lookup -- Then python can do a pointer compare rather than a string compare to see if the strings are the same. http://docs.python.org/2/library/functions.html#intern – mgilson Feb 16 '13 at 01:39
@mgilson Yup - the debate was about the behaviour change in 2.3 - knew something rang a bell... can't say I've ever had to use it though ;) – Jon Clements Feb 16 '13 at 01:41
@JonClements -- Nor should you. That is in the "non-essential builtin functions" section. – mgilson Feb 16 '13 at 01:42

score 2 · Answer 3 · edited May 23 '17 at 12:04

[update] in order to answer the question, it is necessary to know why, how and when Python reuse strings.

Lets start with the how: Python uses "interned" strings - from wikipedia:

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.

Why? Seems like saving memory is not the main goal here, only a nice side-effect.

String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining every character of both strings. This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fills up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.

String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage. Such strings may include magic numbers or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory.

Now the "when": cpython "interns" a string in the following situations:

when you use the intern() non-essential built-in function (which moved to sys.intern in Python 3).
small strings (0 or 1 byte) - this very informative article by Laurent Luce explains the implementation
the names used in Python programs are automatically interned
the dictionaries used to hold module, class or instance attributes have interned keys

For other situations, every implementation seems to have great variation on when strings are automatically interned.

I could not possibly put it better than Alex Martinelli did in this answer (no wonder the guy has a 245k reputation):

Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.

So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).

I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).

[initial answer]

This is more a comment than an answer, but the comment system is not well suited for posting code:

def main():
    while True:
        s = raw_input('feed me:')
        print '"{}".id = {}'.format(s, id(s))

if __name__ == '__main__':
    main()

Running this gives me:

"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688

From my experience, at least on 2.7, there is some optimization going on even for raw_input().

If the implementation uses hash tables, I guess there is more than one. Going to dive in the source right now.

[first update]

Looks like my experiment was flawed:

def main():
    storage = []
    while True:
        s = raw_input('feed me:')
        print '"{}".id = {}'.format(s, id(s))
        storage.append(s)

if __name__ == '__main__':
    main()

Result:

"test".id = 43274944
"test".id = 43277104
"test".id = 43487408
"test".id = 43487504
"test".id = 43487552
"test".id = 43487600
"test".id = 43487648
"test".id = 43487744
"test".id = 43487864
"test".id = 43487936
"test".id = 43487984
"test".id = 43488032

In his answer to another question, user tzot warns about object lifetime:

A side note: it is very important to know the lifetime of objects in Python. Note the following session:

Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10) 
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False

Your thinking that by printing the IDs of two separate expressions and noting “they are equal ergo the two expressions must be equal/equivalent/the same” is faulty. A single line of output does not necessarily imply all of its contents were created and/or co-existed at the same single moment in time.

If you want to know if two objects are the same object, ask Python directly (using the is operator).

you will get the same result even if you use different inputs. this is the garbage collection happening here. — dnozay, Feb 16 '13 at 02:23
actually I lie, I get the same result even with `gc.disable()`. — dnozay, Feb 16 '13 at 02:26

Strings from `raw_input()` in memory

3 Answers3