[update] in order to answer the question, it is necessary to know why, how and when Python reuse strings.
Lets start with the how: Python uses "interned" strings - from wikipedia:
In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.
Why? Seems like saving memory is not the main goal here, only a nice side-effect.
String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining every character of both strings. This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fills up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.
String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage. Such strings may include magic numbers or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory.
Now the "when": cpython "interns" a string in the following situations:
- when you use the
intern()
non-essential built-in function (which moved to sys.intern
in Python 3).
- small strings (0 or 1 byte) - this very informative article by Laurent Luce explains the implementation
- the names used in Python programs are automatically interned
- the dictionaries used to hold module, class or instance attributes have interned keys
For other situations, every implementation seems to have great variation on when strings are automatically interned.
I could not possibly put it better than Alex Martinelli did in this answer (no wonder the guy has a 245k reputation):
Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.
So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).
I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).
[initial answer]
This is more a comment than an answer, but the comment system is not well suited for posting code:
def main():
while True:
s = raw_input('feed me:')
print '"{}".id = {}'.format(s, id(s))
if __name__ == '__main__':
main()
Running this gives me:
"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688
From my experience, at least on 2.7, there is some optimization going on even for raw_input()
.
If the implementation uses hash tables, I guess there is more than one. Going to dive in the source right now.
[first update]
Looks like my experiment was flawed:
def main():
storage = []
while True:
s = raw_input('feed me:')
print '"{}".id = {}'.format(s, id(s))
storage.append(s)
if __name__ == '__main__':
main()
Result:
"test".id = 43274944
"test".id = 43277104
"test".id = 43487408
"test".id = 43487504
"test".id = 43487552
"test".id = 43487600
"test".id = 43487648
"test".id = 43487744
"test".id = 43487864
"test".id = 43487936
"test".id = 43487984
"test".id = 43488032
In his answer to another question, user tzot warns about object lifetime:
A side note: it is very important to know the lifetime of objects in Python. Note the following session:
Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False
Your thinking that by printing the IDs of two separate expressions and noting “they are equal ergo the two expressions must be equal/equivalent/the same” is faulty. A single line of output does not necessarily imply all of its contents were created and/or co-existed at the same single moment in time.
If you want to know if two objects are the same object, ask Python directly (using the is
operator).