When I pickle a dictionary of dataframes and then unpickle them again, I experience a kind of memory leak. After the unpickled variable is dereferenced, the memory is only released partially. Calling gc.collect()
does not help. I have created the following minimal exmaple:
import pickle
import numpy as np
import pandas as pd
new = np.zeros((1000, 100))
new = pd.DataFrame(new)
cc = {ix: new.copy() for ix in range(500)}
pickle.dump(cc, open('/tmp/test21', 'wb'))
Now I open a clean python session and do
import pickle
# memory consumption is around 40MB
data = pickle.load(open('/tmp/test21'))
# memory consumption goes to 991MB
data = None
# memory consumption goes to 776MB
This is pandas 0.19.2 and python 2.7.13. The problem seems to be the interaction between pickle, dictionary and pandas. If I remove the line new = pd.DataFrame(new)
, the problem does not occur. If I simply make a large df without a dictionary, the problem does not occur. If I don't pickle the dictionary and set cc = None
, the problem does not occur. I have also tested the problem with pandas 0.14.1 and python 2.7.13. Finally the problem appears with both pickle and cPickle.
What could be the reason or a strategy to analyze this further? Any help is much appreciated!