Why tempory variable calculating in python for-loop takes so much memory usage?

Question

The following two codes are equivalent, but the first one takes about 700M memory, the latter one takes only about 100M memory(via windows task manager). What happens here?

def a():
    lst = []
    for i in range(10**7):
        t = "a"
        t = t * 2
        lst.append(t)
    return lst

_ = a()

def a():
    lst = []
    for i in range(10**7):
        t = "a" * 2
        lst.append(t)
    return lst

_ = a()

I got 110M and 650M usage from task manager. 3.7.3 64bit. EDIT: Using ```psutil``` i also get memory usage at 95M and 625M. — Joshua Nixon, Sep 06 '19 at 11:03
Looking at the byte code, the interpreter has already multipled "a" * 2 when generating code vs doing it at runtime...but I don't see why that would make such a large difference in memory usage — , Sep 06 '19 at 11:24
I have no idea why the difference is so large. I'd be very interested in the answer. — Joshua Nixon, Sep 06 '19 at 11:30
It is probably because of the garbage collection has not freed `t="a"`. Because strings are immutable `t = t * 2` allocates always new memory. But this does not describe this huge difference. — nauer, Sep 06 '19 at 11:54

ead · Accepted Answer · 2019-09-07T19:33:24.623

@vurmux presented the right reason for the different memory usage: string interning, but some important details seem to be missing.

CPython-implementation interns some strings during the compilation, e.g "a"*2 - for more info about how/why "a"*2 gets interned see this SO-post.

Clarification: As @MartijnPieters has correctly pointed out in his comment: the important thing is whether the compiler does the constant-folding (e.g. evaluates the multiplication of two constants "a"*2) or not. If constant-folding is done, the resulting constant will be used and all elements in the list will be references to the same object, otherwise not. Even if all string constants get interned (and thus constant folding performed => string interned) - still it was sloppy to speak about interning: constant folding is the key here, as it explains the behavior also for types which have no interning at all, for example floats (if we would use t=42*2.0).

Whether constant folding has happened, can be easily verified with dis-module (I call your second version a2()):

>>> import dis
>>> dis.dis(a2)
  ...
  4          18 LOAD_CONST               2 ('aa')
             20 STORE_FAST               2 (t)
  ...

As we can see, during the run time the multiplication isn't performed, but directly the result (which was computed during the compiler time) of the multiplication is loaded - the resulting list consists of references to the same object (the constant loaded with 18 LOAD_CONST 2):

>>> len({id(s) for s in a2()})
1

There, only 8 bytes per reference are needed, that means about 80Mb (+overalocation of the list + memory needed for the interpreter) memory needed.

In Python3.7 constant folding isn't performed if the resulting string has more than 4096 characters, so replacing "a"*2 with "a"*4097 leads to the following byte-code:

 >>> dis.dis(a1)
 ...
  4          18 LOAD_CONST               2 ('a')
             20 LOAD_CONST               3 (4097)
             22 BINARY_MULTIPLY
             24 STORE_FAST               2 (t)
 ...

Now, the multiplication isn't precalculated, the references in the resulting string will be of different objects.

The optimizer is yet not smart enough to recognize, that t is actually "a" in t=t*2, otherwise it would be able to perform the constant folding, but for now the resulting byte-code for your first version (I call it a2()):

... 5 22 LOAD_CONST 3 (2) 24 LOAD_FAST 2 (t) 26 BINARY_MULTIPLY 28 STORE_FAST 2 (t) ...

and it will return a list with 10^7 different objects (but all object being equal) inside:

>>> len({id(s) for s in a1()})
10000000

i.e. you will need about 56 bytes per string (sys.getsizeof returns 51, but because the pymalloc-memory-allocator is 8-byte aligned, 5 bytes will be wasted) + 8 bytes per reference (assuming 64bit-CPython-version), thus about 610Mb (+overalocation of the list + memory needed for the interpreter).

You can enforce the interning of the string via sys.intern:

import sys
def a1_interned():
    lst = []
    for i in range(10**7):
        t = "a"
        t = t * 2
        # here ensure, that the string-object gets interned
        # returned value is the interned version
        t = sys.intern(t) 
        lst.append(t)
    return lst

And realy, we can now not only see, that less memory is needed, but also that the list has references to the same object (see it online for a slightly smaller size(10^5) here):

>>> len({id(s) for s in a1_interned()})
1
>>> all((s=="aa" for s in a1_interned())
True

String interning can save a lot of memory, but it is sometimes tricky to understand, whether/why a string gets interned or not. Calling sys.intern explicitly eliminates this uncertainty.

The existence of additional temporary objects referenced by t is not the problem: CPython uses reference counting for memory managment, so an object gets deleted as soon as there is no references to it - without any interaction from the garbage collector, which in CPython is only used to break-up cycles (which is different to for example Java's GC, as Java doesn't use reference counting). Thus, temporary variables are really temporaries - those objects cannot be accumulated to make any impact on memory usage.

The problem with the temporary variable t is only that it prevents peephole optimization during the compilation, which is performed for "a"*2 but not for t*2.

It’s not really “interning”; just the storing of *constants* with the code object. String interning is a different process, where even strings used in different code blocks re-use the same string object on the heap. See [What does python sys.intern do, and when should it be used?](https://stackoverflow.com/q/1136826). — Martijn Pieters, Sep 07 '19 at 12:12
@MartijnPieters Thanks, you are right. Because all string constants are interned in `PyCode_New` (constant folding performed = interned) I have jumped to the wrong conclusion. Clearly "constant folding performed" is the key here, as it would explain also results for types which have no interning (as for example `float`, e.g. `t=3.14*2`). — ead, Sep 07 '19 at 19:31
So in the case `t=3.14` `t=t*2`(and many other situations), if I want to save memory, if there is a way like str.intern()? — Sinh, Sep 08 '19 at 04:41

score 3 · Answer 2 · answered Sep 06 '19 at 11:43

This difference is exist because of string interning in Python interpreter:

String interning is the method of caching particular strings in memory as they are instantiated. The idea is that, since strings in Python are immutable objects, only one instance of a particular string is needed at a time. By storing an instantiated string in memory, any future references to that same string can be directed to refer to the singleton already in existence, instead of taking up new memory.

Let me show it in a simple example:

>>> t1 = 'a'
>>> t2 = t1 * 2
>>> t2 is 'aa'
False

>>> t1 = 'a'
>>> t2 = 'a'*2
>>> t2 is 'aa'
True

When you use the first variant, the Python string interning is not used so the interpreter creates additional internal variables to store temporal data. It can't optimize many-lines-code this way.

I am not a Python guru, but I think the interpreter works this way:

t = "a"
t = t * 2

In the first line it creates an object for t. In the second line it creates a temporary object for t right of the = sign and writes the result in the third place in the memory (with GC called later). So the second variant should use at least 3 times less memory than the first.

P.S. You can read more about string interning here.

Why tempory variable calculating in python for-loop takes so much memory usage?

2 Answers2

Linked