when does Python allocate new memory for identical strings?

Question

Two Python strings with the same characters, a == b, may share memory, id(a) == id(b), or may be in memory twice, id(a) != id(b). Try

ab = "ab"
print id( ab ), id( "a"+"b" )

Here Python recognizes that the newly created "a"+"b" is the same as the "ab" already in memory -- not bad.

Now consider an N-long list of state names [ "Arizona", "Alaska", "Alaska", "California" ... ] (N ~ 500000 in my case).
I see 50 different id() s ⇒ each string "Arizona" ... is stored only once, fine.
BUT write the list to disk and read it back in again: the "same" list now has N different id() s, way more memory, see below.

How come -- can anyone explain Python string memory allocation ?

""" when does Python allocate new memory for identical strings ?
    ab = "ab"
    print id( ab ), id( "a"+"b" )  # same !
    list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once
    but list > file > mem again: N ids, mem ~ N * (4 + S)
"""

from __future__ import division
from collections import defaultdict
from copy import copy
import cPickle
import random
import sys

states = dict(
AL = "Alabama",
AK = "Alaska",
AZ = "Arizona",
AR = "Arkansas",
CA = "California",
CO = "Colorado",
CT = "Connecticut",
DE = "Delaware",
FL = "Florida",
GA = "Georgia",
)

def nid(alist):
    """ nr distinct ids """
    return "%d ids  %d pickle len" % (
        len( set( map( id, alist ))),
        len( cPickle.dumps( alist, 0 )))  # rough est ?
# cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents

N = 10000
exec( "\n".join( sys.argv[1:] ))  # var=val ...
random.seed(1)

    # big list of random names of states --
names = []
for j in xrange(N):
    name = copy( random.choice( states.values() ))
    names.append(name)
print "%d strings in mem:  %s" % (N, nid(names) )  # 10 ids, even with copy()

    # list to a file, back again -- each string is allocated anew
joinsplit = "\n".join(names).split()  # same as > file > mem again
assert joinsplit == names
print "%d strings from a file:  %s" % (N, nid(joinsplit) )

# 10000 strings in mem:  10 ids  42149 pickle len  
# 10000 strings from a file:  10000 ids  188080 pickle len
# Python 2.6.4 mac ppc

Added 25jan:
There are two kinds of strings in Python memory (or any program's):

Ustrings, in a Ucache of unique strings: these save memory, and make a == b fast if both are in Ucache
Ostrings, the others, which may be stored any number of times.

intern(astring) puts astring in the Ucache (Alex +1); other than that we know nothing at all about how Python moves Ostrings to the Ucache -- how did "a"+"b" get in, after "ab" ? ("Strings from files" is meaningless -- there's no way of knowing.)
In short, Ucaches (there may be several) remain murky.

A historical footnote: SPITBOL uniquified all strings ca. 1970.

score 49 · Accepted Answer · answered Jan 23 '10 at 17:34

Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.

So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).

I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).

Is there anything of value within my answer that you don't think is covered in yours? If not, I'll delete my answer. If there is, do you want to edit it into yours and *then* I'll delete my answer? — Jon Skeet, Jan 23 '10 at 17:54
+1 for mentioning `intern`. I had completely forgotten that this function existed. Using `joinsplit = [intern(n) for n in "\n".join(names).split()]` did the job and lowered memory usage from 4,374,528 to 3,190,783 on my MacBook. — D.Shawley, Jan 23 '10 at 18:20
@John, I think having the two viewpoints (mine from an "insider's perspective", yours from an experienced programmer without a special "insider's perspective" on Python) is valuable as it stands -- not sure there's an optimal way to get the same "triangulation" within a single answer! — Alex Martelli, Jan 23 '10 at 18:36
Lua always has only one instance of any particular string. It's a very neat system: a bit of overhead on string creation (very small in practice) makes all comparisons for string equality an O(1) pointer comparison. — Glenn Maynard, Jan 23 '10 at 20:46
@AlexMartelli `intern` is no longer in python as of 3.4. You mention you can "roll your own"; but I'm not sure I understand how to do that... — max, Mar 22 '15 at 08:11
@max, you make a factory function that uses a hash table (for speed) to hold immutables (strings, tuples, whatever) and returns a reference to the existing one if any, the newly inserted one if previously absent. — Alex Martelli, Mar 22 '15 at 17:12
@max For Python 3, `intern` is in the `sys` module: https://docs.python.org/3/library/sys.html. In general, to roll your own, you can establish a data structure that holds objects of the types you like (e.g. a dictionary) and do the same sort of thing that intern does: establish a storage/lookup method which returns keys from the dictionary as references. — nealmcb, Jan 20 '19 at 17:34

Jon Skeet · Answer 2 · 2010-01-23T17:17:25.843

21

I strongly suspect that Python is behaving like many other languages here - recognising string constants within your source code and using a common table for those, but not applying the same rules when creating strings dynamically. This makes sense as there will only be a finite set of strings within your source code (although Python lets you evaluate code dynamically, of course) whereas it's much more likely that you'll be creating huge numbers of strings in the course of your program.

This process is generally called interning - and indeed by the looks of this page it's called interning in Python, too.

edited Jan 23 '10 at 17:17

answered Jan 23 '10 at 17:12

Jon Skeet

1,421,763
867
9,128
9,194

Any idea then why id("ab") == id("a"+"b") ? Would you agree that we just don't know how Python runs Ucaches ? – denis Jan 25 '10 at 17:43
5

For completeness: the expression `"a"+"b"` is statically turned into the expression `"ab"`, which is then found to be the same string as the other one. It all happens at compile-time. – Armin Rigo Nov 14 '13 at 20:34

tzot · Answer 3 · 2012-09-17T21:46:13.457

14

A side note: it is very important to know the lifetime of objects in Python. Note the following session:

Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10) 
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False

Your thinking that by printing the IDs of two separate expressions and noting “they are equal ergo the two expressions must be equal/equivalent/the same” is faulty. A single line of output does not necessarily imply all of its contents were created and/or co-existed at the same single moment in time.

If you want to know if two objects are the same object, ask Python directly (using the is operator).

edited Sep 17 '12 at 21:46

answered Jan 24 '10 at 02:09

tzot

92,761
29
141
204

11

A bit of explanation as to what's going on here: the `print id(a+b), id(b+a)` line first concatenates "a" and "b" into a newly-allocated string "ab", then passes that to `id`, then deallocates it since it's no longer needed. Then "ba" is allocated in the same way, and ends up being allocated at the same location in memory (CPython has a habit of doing this). "ba" is then passed to `id`, which returns the same result. With the next line, however, both "ab" and "ba" are kept around to be passed to the `is` operator, so they are necessarily allocated at different positions. – javawizard Nov 27 '12 at 02:57

score 3 · Answer 4 · answered Jan 23 '10 at 17:54

x = 42
y = 42
x == y #True
x is y #True

In this interaction, X and Y should be == (same value), but not is (same object) because we ran two different literal expressions. Because small integers and strings are cached and reused, though, is tells us they reference the same single object.

In fact, if you really want to look under the hood, you can always ask Python how many references there are to an object using the getrefcount function in the standard sys module returns the object’s reference count. This behavior reflects one of the many ways Python optimizes its model for execution speed.

Learning Python

score 2 · Answer 5 · answered Dec 12 '18 at 06:27

I found a good article to explain the intern behavior of CPython: http://guilload.com/python-string-interning/

In short:

String object in CPython has a flag to indicate that if it's in intern.
Interning string by storing them in a normal dictionary with keys and values are string's pointers. This accepts string class only.
Interning help Python to reduce memory consumption because objects can refer to the same memory address, and speed up comparison speed because it only has to compare the string's pointers.
Python does the intern in the compile process, which means only literal strings (or string can be computed at compile time, like 'hello' + 'world')
For your question: Only strings with length 0 or length 1 or contains ASCII letters only(a-z, A-Z, 0-9) are interned
Intern works in Python due to strings are immutable, otherwise does not make sense.

This is a really good article, I strongly suggest visiting his site and check for other ones, worth our time.

when does Python allocate new memory for identical strings?

5 Answers5

Linked

Related