4

I have an object (defaultdict) with structure: { srting : [(string, (float, float)), (string, (float, float)), ....]}

The size of it is about 12,5 MB

I am pickling with code:

with open(Path_to_file, 'wb') as file:
    pickle.dump(data_dict, file)

Pickle file weights about 300 MB. In proccess unpickling with code:

with open(Path_to_file, 'rb') as file:
    data_dict_new = pickle.load(file)

system is using a lot of RAM (about 3,5 GB and more). But after unpickling Python uses about 1 GB of RAM.

So I have two questions:

  1. What does keep in RAM apart of my structure?
  2. How can I clean it?

gc.collect() doesn't help.

Ivan Savin
  • 151
  • 8
  • Sorry, it is my first question. So I have two questions: 1. What does keep in RAM apart of my structure? 2. How can I clean it? – Ivan Savin Feb 12 '16 at 11:05
  • When you pickle the object try using `pickle.HIGHEST_PROTOCOL` to select a more efficient binary [_Data Stream Format_](https://docs.python.org/2/library/pickle.html#data-stream-format) format. – martineau Feb 12 '16 at 11:51
  • Do you mean question 1: what is the overhead of pickle in RAM when reading or writing EXCLUDING the size of the "data_dict" question 2: is there some way I can reduce the amount of RAM in use. I cannot replicate your observations on my tests of pickle btw. I get a large file on disk (16x RAM) and a similar amount of RAM used before the pickle "dump" and after teh pickle "load" – Vorsprung Feb 12 '16 at 11:51
  • btw, cPickle is very much faster than pickle – Vorsprung Feb 12 '16 at 12:06
  • Definitely not. AFTER loading pickle file (after with-block) I have only "data_dict". So, how I think memory in RAM of Python must be egual to size of "data_dict", so equals 12,5 MB. Why does it keep about 1 GB – Ivan Savin Feb 12 '16 at 12:10
  • I am using ``/usr/bin/time -f 'RSS:%M KB' python scriptname`` to see RAM in use – Vorsprung Feb 12 '16 at 13:05
  • Might be useful, especially with regards to question 1.: https://stackoverflow.com/a/53941920/2734863 – Frikster May 27 '21 at 20:36

1 Answers1

1

I was able to reproduce this. Indeed, if you're unpickling a large (about 300M) file, a lot of extra memory stay used. In my case, 1.6G was used by a process just to keep original generated data_dict, and 2.9G if I load it from file.

However, if you'll run unpickling in a subprocess, system will do a full memory clean after process join(). (as stated in this answer: https://stackoverflow.com/a/1316799/1102535). So the example of unpickling without extra memory used:

from multiprocessing import Process, Manager

def load_pickle(filename, data):
    import cPickle as pickle
    with open(filename, 'rb') as file:
        data_pkl = pickle.load(file)
    for key, val in data_pkl.iteritems():
        data[key] = val

manager = Manager()
data_dict = manager.dict()
p = Process(target=load_pickle, args=("test.pkl", data_dict))
p.start()
p.join()
print len(data_dict)

This code has its drawbacks (like copying between dicts), but at least you have the idea. As for me, it uses almost the same amount of memory after unpickling as for original data before pickling.

Community
  • 1
  • 1
a5kin
  • 1,335
  • 16
  • 20
  • Thank you very much for idea, but it isn't work for me. The data_dict is empty. I use Python 3. The reason maybe here. – Ivan Savin Feb 13 '16 at 23:26
  • The function "load_pickle" isn't called. – Ivan Savin Feb 14 '16 at 10:09
  • Yes, this code was tested with Python 2.7. I just made a test with Python 3 also, and cannot reproduce memory issue at all. The amount of used RAM seems to be the same after unpickling as it was before, even without subprocess trick. If you can provide a full example of how you're testing, it would be very helpful. Btw, I measured process memory usage with ``top`` command, not ``time``. – a5kin Feb 15 '16 at 13:51
  • I use the Task Manager before starts code and after. At first RAM doesn't use and after the start 1 GB of memory is used. The volume of my structure is defined by a function `sys.getsizeof()`. – Ivan Savin Feb 15 '16 at 14:26