6

I have a Python 3.x program that processes several large text files that contain sizeable arrays of data that can occasionally brush up against the memory limit of my puny workstation. From some basic memory profiling, it seems like when using the generator, the memory usage of my script balloons to hold consecutive elements, using up to twice the memory I expect.

I made a simple, stand alone example to test the generator and I get similar results in Python 2.7, 3.3, and 3.4. My test code follows, memory_usage() is a modifed version of this function from an SO question which uses /proc/self/status and agrees with top as I watch it. resource is probably a more cross-platform method:

import sys, resource, gc, time

def biggen():
    sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1
    for size in sizes:
        data = [1] * int(size * 1e6)
        #time.sleep(1)
        yield data

def consumer():
    for data in biggen():
        rusage = resource.getrusage(resource.RUSAGE_SELF)
        peak_mb = rusage.ru_maxrss/1024.0
        print('Peak: {0:6.1f} MB, Data Len: {1:6.1f} M'.format(
                peak_mb, len(data)/1e6))
        #print(memory_usage()) # 

        data = None  # go
        del data     # away
        gc.collect() # please.

# def memory_usage():
#     """Memory usage of the current process, requires /proc/self/status"""
#     # https://stackoverflow.com/a/898406/194586
#     result = {'peak': 0, 'rss': 0}
#     for line in open('/proc/self/status'):
#         parts = line.split()
#         key = parts[0][2:-1].lower()
#         if key in result:
#             result[key] = int(parts[1])/1024.0
#     return 'Peak: {peak:6.1f} MB, Current: {rss:6.1f} MB'.format(**result)

print(sys.version)
consumer()

In practice I'll process data coming from such a generator loop, saving just what I need, then discard it.

When I run the above script, and two large elements come in series (the data size can be highly variable), it seems like Python computes the next before freeing the previous, leading to up to double the memory usage.

$ python genmem.py 
2.7.3 (default, Sep 26 2013, 20:08:41) 
[GCC 4.6.3]
Peak:    7.9 MB, Data Len:    1.0 M
Peak:   11.5 MB, Data Len:    1.0 M
Peak:   45.8 MB, Data Len:   10.0 M
Peak:   45.9 MB, Data Len:    1.0 M
Peak:   45.9 MB, Data Len:    1.0 M
Peak:   45.9 MB, Data Len:   10.0 M
#        ^^  not much different versus previous 10M-list
Peak:   80.2 MB, Data Len:   10.0 M
#        ^^  same list size, but new memory peak at roughly twice the usage
Peak:   80.2 MB, Data Len:    1.0 M
Peak:   80.2 MB, Data Len:    1.0 M
Peak:   80.2 MB, Data Len:   10.0 M
Peak:   80.2 MB, Data Len:   10.0 M
Peak:  118.3 MB, Data Len:   20.0 M
#        ^^  and again...  (20+10)*x
Peak:  118.3 MB, Data Len:    1.0 M
Peak:  118.3 MB, Data Len:    1.0 M
Peak:  118.3 MB, Data Len:   20.0 M
Peak:  156.5 MB, Data Len:   20.0 M
#        ^^  and again. (20+20)*x
Peak:  156.5 MB, Data Len:    1.0 M
Peak:  156.5 MB, Data Len:    1.0 M

The crazy belt-and-suspenders-and-duct-tape approach data = None, del data, and gc.collect() does nothing.

I'm pretty sure the generator itself is not doubling up on memory because otherwise a single large value it yields would increase the peak usage, and in the same iteration a large object appeared; it's only large consecutive objects.

How can I save my memory?

Community
  • 1
  • 1
Nick T
  • 25,754
  • 12
  • 83
  • 121
  • `id_ = None` is useless as _id is referenced by ids. – njzk2 Feb 14 '14 at 18:43
  • if you only care about the first element of `data`, you should refactor `plate.good_data` to give you a generator, then just grab the first element yielded from it, no? It seems like all your problems are coming from loading giant pieces of `data` into memory, the vast majority of which you don't care about. – roippi Feb 14 '14 at 18:56
  • what about using temporary list for `i` and `data`? Something like: `for [i, data] in enumerate(plate.good_data())`. There is a chance the garbage collector does something here? – Raydel Miranda Feb 14 '14 at 19:12
  • @roippi in use, I will be passing `data` to another function that crunches it down to something more manageable. I was trying to make progressively more minimal 'programs' to narrow down what the problem was, and even without keeping any reference to it (that I can see), it still eats memory. – Nick T Feb 14 '14 at 19:41
  • @roippi I might be able to modify my generator to accept a function that does said processing before it returns the item... – Nick T Feb 14 '14 at 19:41
  • I must say, this is the first time I am hearing that generators increase memory usage. Usually it is the opposite. Also, if you think the generators are computing `next` and therefore increasing memory usage, then maybe your should not be using the second generator `enumerate` – smac89 Feb 14 '14 at 19:50
  • @Smac89 it certainly decreases usage over jamming everything into a list (for this thing that would mean about 30 GB of memory used), but it's maddening that it's using up to twice as much memory as any one object returned from the generator at a time. – Nick T Feb 14 '14 at 21:14

3 Answers3

1

The problem is in the generator function; particularly in the statement:

    data = [1] * int(size * 1e6)

Suppose you have old content in the data variable. When you run this statement, it first computes the result, thus you have 2 these arrays in memory; old and new. Only then is data variable changed to point to the new structure and the old structure is released. Try to modify the iterator function to:

def biggen():
    sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1
    for size in sizes:
        data = None
        data = [1] * int(size * 1e6)
        yield data
ondra
  • 9,122
  • 1
  • 25
  • 34
0

Have you tried using the gc module? There you can get a list of the objects that still reference your large data between loops, check if its in the list of unreachable but unfreed objects, or enable some debugs flags.

With luck, a simple call to gc.collect() after each loop may fix your problem in a single line.

BoppreH
  • 8,014
  • 4
  • 34
  • 71
  • Good thinking, but I get the same result with a `collect()` (I forgot to mention, but I updated my post). I'll look at having it tell me what might be keeping a ref though. – Nick T Feb 14 '14 at 19:46
  • I updated the question to provide an example anyone could run. – Nick T Feb 14 '14 at 22:12
0

Instead of:

        data = [1] * int(size * 1e6)
        #time.sleep(1)
        yield data

Try:

        yield [1] * int(size * 1e6)

The problem is simply that the generator's data local variable keeps a reference to the yielded list, preventing it from ever being garbage collected until the generator resumes and discards the reference.

In other words, doing del data outside the generator has no effect on garbage collection unless that's the only reference to the data. Avoiding a reference inside the generator makes that true.

Addendum

If you have to manipulate the data, first, you can use a hack like this to drop the reference before yielding it:

        data = [1] * int(size * 1e6)
        # ... do stuff with data ...

        # Yield data without keeping a reference to it:
        hack = [data]
        del data
        yield hack.pop()
Pi Delport
  • 10,356
  • 3
  • 36
  • 50
  • I see it now between yours and ondra's post. With my simplifications it's easy to make the fix seem trivial, but my actual code makes it a bit more difficult as I do other manipulations with the data and have sequential generators. – Nick T Feb 15 '14 at 17:10
  • @NickT: I added an example of a quick hack to let you work with a local variable, but still drop the reference when yielding. Beyond that, you'll probably have to show your actual code for further advice. – Pi Delport Feb 16 '14 at 01:28