0

I have a python script which is processing a large amount of data from compressed ASCII. After a short period, it runs out of memory. I am not constructing large lists or dicts. The following code illustrates the issue:

import struct
import zlib
import binascii
import numpy as np
import psutil
import os
import gc

process = psutil.Process(os.getpid())
n = 1000000
compressed_data = binascii.b2a_base64(bytearray(zlib.compress(struct.pack('%dB' % n, *np.random.random(n))))).rstrip()

print 'Memory before entering the loop is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))
for i in xrange(2):
    print 'Memory before iteration %d is %d MB' % (i, process.get_memory_info()[0] / float(2 ** 20))
    byte_array = zlib.decompress(binascii.a2b_base64(compressed_data))
    a = np.array(struct.unpack('%dB' % (len(byte_array)), byte_array))
    gc.collect()
gc.collect()
print 'Memory after last iteration is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))

It prints:

Memory before entering the loop is 45 MB
Memory before iteration 0 is 45 MB
Memory before iteration 1 is 51 MB
Memory after last iteration is 51 MB

Between the first and second iteration, 6 MB of memory get created. If i run the loop more than two times, the memory usage stays at 51 MB. If I put the code to decompress into its own function and feed it the actual compressed data, the memory usage will continue to grow. I am using Python 2.7. Why is the memory increasing and how can it be corrected? Thank you.

user2133814
  • 2,431
  • 1
  • 24
  • 34
  • I wouldn't say, that is a memory leak, it is normal memory consumption. – Daniel Dec 01 '14 at 20:47
  • Besides looking quite normal, as @Daniel said, how about the `byte_array` and the `a = np.array`? Your first iteration outputs the memory usage *before* instantiating them. That sounds like a lot of data, which is likely not to be destroyed by the garbage collector because you call it within the `for` loop scope. Unindent (move left) that `gc.collect()` so it runs outside the `for` loop, and see what happens. – Savir Dec 01 '14 at 20:49
  • @BorrajaX added another gc.collect before the last print and after the loop exits, no change. For all the print statements the byte_array and "a" variables shouldnt exist in memory – user2133814 Dec 01 '14 at 20:55
  • sorry, sorry... Even after the `for` loop, `byte_array` and `a` are in your scope (my bad, they don't get destroyed). Right after the loop ends (and before your second `gc.collect()` that you just added) do `byte_array = None` `a=None`... Now I'm curious myself **:-)** – Savir Dec 01 '14 at 20:57
  • 1
    @BorrajaX added in those set to None statements and it cleared the memory, fixing the concern i had. I misunderstood Python scoping, I'm more used to Java. Anyways, i still have an issue in my code but the above example doesn't correctly show it. Thanks – user2133814 Dec 01 '14 at 21:06
  • I'm gonna add this as an answer so you can choose it and give me juicy reputation **:-D** (if you want, if you waaaAAAAaant ) But yeah, it made me curious, so I did investigate a bit... – Savir Dec 01 '14 at 21:08
  • @BorrajaX I'll give you even more rep if you can help me with http://stackoverflow.com/questions/27251451/python-memory-leak-with-struct-and-numpy – user2133814 Dec 02 '14 at 14:26
  • Looks like you figured out yourself. Nice!! **:-)** – Savir Dec 02 '14 at 14:45

1 Answers1

1

Through comments, we figured out what was going on:

The main issue is that variables declared in a for loop are not destroyed once the loop ends. They remain accessible, pointing to the value they received in the last iteration:

>>> for i in range(5):
...     a=i
...
>>> print a
4

So here's what's happening:

  • First iteration: The print is showing 45MB, which the memory before instantiating byte_array and a.
  • The code instantiates those two lengthy variables, making the memory go to 51MB
  • Second iteration: The two variables instantiated in the first run of the loop are still there.
  • In the middle of the second iteration, byte_array and a are overwritten by the new instantiation. The initial ones are destroyed, but substituted by equally lengthy variables.
  • The for loop ends, but byte_array and a are still accessible in the code, therefore, not destroyed by the second gc.collect() call.

Changing the code to:

for i in xrange(2):
   [ . . . ]
byte_array = None
a = None
gc.collect()

made the memory resreved by byte_array and a unaccessible, and therefore, freed.

There's more on Python's garbage collection in this SO answer: https://stackoverflow.com/a/4484312/289011

Also, it may be worth looking at How do I determine the size of an object in Python?. This is tricky, though... if your object is a list pointing to other objects, what is the size? The sum of the pointers in the list? The sum of the size of the objects those pointers point to?

Community
  • 1
  • 1
Savir
  • 17,568
  • 15
  • 82
  • 136