Python Memory Leak Using binascii, zlib, struct, and numpy

Question

I have a python script which is processing a large amount of data from compressed ASCII. After a short period, it runs out of memory. I am not constructing large lists or dicts. The following code illustrates the issue:

import struct
import zlib
import binascii
import numpy as np
import psutil
import os
import gc

process = psutil.Process(os.getpid())
n = 1000000
compressed_data = binascii.b2a_base64(bytearray(zlib.compress(struct.pack('%dB' % n, *np.random.random(n))))).rstrip()

print 'Memory before entering the loop is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))
for i in xrange(2):
    print 'Memory before iteration %d is %d MB' % (i, process.get_memory_info()[0] / float(2 ** 20))
    byte_array = zlib.decompress(binascii.a2b_base64(compressed_data))
    a = np.array(struct.unpack('%dB' % (len(byte_array)), byte_array))
    gc.collect()
gc.collect()
print 'Memory after last iteration is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))

It prints:

Memory before entering the loop is 45 MB
Memory before iteration 0 is 45 MB
Memory before iteration 1 is 51 MB
Memory after last iteration is 51 MB

Between the first and second iteration, 6 MB of memory get created. If i run the loop more than two times, the memory usage stays at 51 MB. If I put the code to decompress into its own function and feed it the actual compressed data, the memory usage will continue to grow. I am using Python 2.7. Why is the memory increasing and how can it be corrected? Thank you.

I wouldn't say, that is a memory leak, it is normal memory consumption. — Daniel, Dec 01 '14 at 20:47
Besides looking quite normal, as @Daniel said, how about the `byte_array` and the `a = np.array`? Your first iteration outputs the memory usage *before* instantiating them. That sounds like a lot of data, which is likely not to be destroyed by the garbage collector because you call it within the `for` loop scope. Unindent (move left) that `gc.collect()` so it runs outside the `for` loop, and see what happens. — Savir, Dec 01 '14 at 20:49
@BorrajaX added another gc.collect before the last print and after the loop exits, no change. For all the print statements the byte_array and "a" variables shouldnt exist in memory — user2133814, Dec 01 '14 at 20:55
sorry, sorry... Even after the `for` loop, `byte_array` and `a` are in your scope (my bad, they don't get destroyed). Right after the loop ends (and before your second `gc.collect()` that you just added) do `byte_array = None` `a=None`... Now I'm curious myself **:-)** — Savir, Dec 01 '14 at 20:57
@BorrajaX added in those set to None statements and it cleared the memory, fixing the concern i had. I misunderstood Python scoping, I'm more used to Java. Anyways, i still have an issue in my code but the above example doesn't correctly show it. Thanks — user2133814, Dec 01 '14 at 21:06
I'm gonna add this as an answer so you can choose it and give me juicy reputation **:-D** (if you want, if you waaaAAAAaant ) But yeah, it made me curious, so I did investigate a bit... — Savir, Dec 01 '14 at 21:08
@BorrajaX I'll give you even more rep if you can help me with http://stackoverflow.com/questions/27251451/python-memory-leak-with-struct-and-numpy — user2133814, Dec 02 '14 at 14:26

score 1 · Accepted Answer · edited May 23 '17 at 12:20

Through comments, we figured out what was going on:

The main issue is that variables declared in a for loop are not destroyed once the loop ends. They remain accessible, pointing to the value they received in the last iteration:

>>> for i in range(5):
...     a=i
...
>>> print a
4

So here's what's happening:

First iteration: The print is showing 45MB, which the memory before instantiating byte_array and a.
The code instantiates those two lengthy variables, making the memory go to 51MB
Second iteration: The two variables instantiated in the first run of the loop are still there.
In the middle of the second iteration, byte_array and a are overwritten by the new instantiation. The initial ones are destroyed, but substituted by equally lengthy variables.
The for loop ends, but byte_array and a are still accessible in the code, therefore, not destroyed by the second gc.collect() call.

Changing the code to:

for i in xrange(2):
   [ . . . ]
byte_array = None
a = None
gc.collect()

made the memory resreved by byte_array and a unaccessible, and therefore, freed.

There's more on Python's garbage collection in this SO answer: https://stackoverflow.com/a/4484312/289011

Also, it may be worth looking at How do I determine the size of an object in Python?. This is tricky, though... if your object is a list pointing to other objects, what is the size? The sum of the pointers in the list? The sum of the size of the objects those pointers point to?

Python Memory Leak Using binascii, zlib, struct, and numpy

1 Answers1

Linked