0

When I pickle a dictionary of dataframes and then unpickle them again, I experience a kind of memory leak. After the unpickled variable is dereferenced, the memory is only released partially. Calling gc.collect() does not help. I have created the following minimal exmaple:

import pickle
import numpy as np
import pandas as pd
new =  np.zeros((1000, 100))
new = pd.DataFrame(new)
cc = {ix: new.copy() for ix in range(500)}
pickle.dump(cc, open('/tmp/test21', 'wb'))

Now I open a clean python session and do

import pickle
# memory consumption is around 40MB
data = pickle.load(open('/tmp/test21'))
# memory consumption goes to 991MB
data = None
# memory consumption goes to 776MB

This is pandas 0.19.2 and python 2.7.13. The problem seems to be the interaction between pickle, dictionary and pandas. If I remove the line new = pd.DataFrame(new), the problem does not occur. If I simply make a large df without a dictionary, the problem does not occur. If I don't pickle the dictionary and set cc = None, the problem does not occur. I have also tested the problem with pandas 0.14.1 and python 2.7.13. Finally the problem appears with both pickle and cPickle.

What could be the reason or a strategy to analyze this further? Any help is much appreciated!

bjonen
  • 1,503
  • 16
  • 24
  • Do you have relatively large RAM? Perhaps, memory is released back to the OS when the free amount reaches a critical amount i.e. runtime might be keeping "freelists" for future allocations. Can you try creating a large object and then `del` it to see how it affects things? For example, `b = np.random.rand(1000, 2000, 300)`. – bantmen Nov 04 '17 at 04:32
  • Might be relevant: https://stackoverflow.com/questions/15455048/releasing-memory-in-python – bantmen Nov 04 '17 at 04:34
  • b = np.random.rand(1000, 2000, 300) pushes memory consumption to 4.5GB. b=None releases back to 53MB exactly where I started from. – bjonen Nov 06 '17 at 09:33
  • Regarding stackoverflow.com/questions/15455048/releasing-memory-in-pyt‌​hon . It seems that `gc.collect()` is able to free up the memory almost entirely for the example presented. In my case dereferenncing brings it down to a bit more than half but gc.collect does not change consumption at all. – bjonen Nov 06 '17 at 10:00

0 Answers0