Problem
I noticed that memory allocated while iterating through a Pandas GroupBy object is not deallocated after iteration. I use resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
(second answer in this post for details) to measure the total amount of active memory used by the Python process.
import resource
import gc
import pandas as pd
import numpy as np
i = np.random.choice(list(range(100)), 4000)
cols = list(range(int(2e4)))
df = pd.DataFrame(1, index=i, columns=cols)
gb = df.groupby(level=0)
# gb = list(gb)
for i in range(3):
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
for idx, x in enumerate(gb):
if idx == 0:
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
# del idx, x
# gc.collect()
prints the following total active memory (in gb)
0.671732
1.297424
1.297952
1.923288
1.923288
2.548624
Solutions
Uncommenting del idx, x
and gc.collect()
fixes the problem. I do however have to del
all variables that reference the DataFrames returned by iterating over the groupby (which can be a pain depending on the code in the inner for loop). The new printed memory usages become:
0.671768
1.297412
1.297992
1.297992
1.297992
1.297992
Alternatively I can uncomment gb = list(gb)
. The resulting memory usages are roughly the same as those from the previous solution:
1.32874
1.32874
1.32874
1.32874
1.32874
1.32874
Questions
- Why is memory for DataFrames resulting from iteration through the groupby not deallocated after iteration is completed?
- Is there a better solution than the two above? If not, which of these two solutions is "better"?