Pandas GroupBy memory deallocation

Question

Problem

I noticed that memory allocated while iterating through a Pandas GroupBy object is not deallocated after iteration. I use resource.getrusage(resource.RUSAGE_SELF).ru_maxrss (second answer in this post for details) to measure the total amount of active memory used by the Python process.

import resource
import gc

import pandas as pd
import numpy as np

i = np.random.choice(list(range(100)), 4000)
cols = list(range(int(2e4)))

df = pd.DataFrame(1, index=i, columns=cols)

gb = df.groupby(level=0)
# gb = list(gb)
for i in range(3):
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
    for idx, x in enumerate(gb):
        if idx == 0:
            print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
    # del idx, x
    # gc.collect()

prints the following total active memory (in gb)

Solutions

Uncommenting del idx, x and gc.collect() fixes the problem. I do however have to del all variables that reference the DataFrames returned by iterating over the groupby (which can be a pain depending on the code in the inner for loop). The new printed memory usages become:

Alternatively I can uncomment gb = list(gb). The resulting memory usages are roughly the same as those from the previous solution:

Questions

Why is memory for DataFrames resulting from iteration through the groupby not deallocated after iteration is completed?
Is there a better solution than the two above? If not, which of these two solutions is "better"?

It is strange, there are new objects created on each iteration and somehow there is a reference being kept so just calling gc.collect is not enough. Using the list approach the same objects are reused so you see no increase in memeory. — Padraic Cunningham, Mar 09 '16 at 19:49

tmthydvnprt · Answer 1 · 2016-06-10T02:53:44.200

1

Memory Weirdness

This is very interesting! You do not need del idx, x. Only using gc.collect() worked to keep memory constant for me. This is much cleaner that having the del statements inside the loop.

edited Jun 10 '16 at 02:53

answered Mar 13 '16 at 03:06

tmthydvnprt

10,398
8
52
72

I do not get the same results. If I don't `del` the reference `x` then memory increases twice (~0.67gb -> ~1.3gb -> 2gb). – Alex Mar 13 '16 at 15:33
Hmm... I ran this in iPython and `del` was not needed... what are you running this in? – tmthydvnprt Mar 13 '16 at 16:09
You mean what version of Pandas? 0.18.0 – Alex Mar 15 '16 at 05:57
3

What environment? Inside iPython? Command line? Part of another full application? – tmthydvnprt Mar 15 '16 at 12:06

score 0 · Answer 2 · answered Jun 13 '16 at 13:10

Why is memory for DataFrames resulting from iteration through the groupby not deallocated after iteration is completed?

Nowhere in your code you del object gb, which means at the end it's still there. One thing is to have an iterator reach the end of its cycle, then I would expect it to die automagically, but the object that gave rise to the iterator persists, in case you need to do something else (iterate again, aggregate, etc).

Pandas GroupBy memory deallocation

Problem

Solutions

Questions

2 Answers2

Memory Weirdness

Linked