Pickle dump huge file without memory error

Question

I have a program where I basically adjust the probability of certain things happening based on what is already known. My file of data is already saved as a pickle Dictionary object at Dictionary.txt.

The problem is that everytime that I run the program it pulls in the Dictionary.txt, turns it into a dictionary object, makes it's edits and overwrites Dictionary.txt. This is pretty memory intensive as the Dictionary.txt is 123 MB. When I dump I am getting the MemoryError, everything seems fine when I pull it in..

Is there a better (more efficient) way of doing the edits? (Perhaps w/o having to overwrite the entire file everytime)
Is there a way that I can invoke garbage collection (through gc module)? (I already have it auto-enabled via gc.enable())
I know that besides readlines() you can read line-by-line. Is there a way to edit the dictionary incrementally line-by-line when I already have a fully completed Dictionary object File in the program.
Any other solutions?

Thank you for your time.

There are a few compressive and other libraries. Personally, I like dill and H5Py for large objects. If you are using scikit learn and have to use a model base on dictionary, perhaps you could use joblib as well (only really for these models). — Andrew Scott Evans, Nov 05 '15 at 22:44

score 21 · Answer 1 · edited Jan 22 '18 at 11:40

21

I was having the same issue. I use joblib and work was done. In case if someone wants to know other possibilities.

save the model to disk

from sklearn.externals import joblib
filename = 'finalized_model.sav'
joblib.dump(model, filename)

some time later... load the model from disk

loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test) 

print(result)

edited Jan 22 '18 at 11:40

Marcin Zablocki

10,171
1
37
47

answered Apr 26 '17 at 07:39

Ch HaXam

499
3
16

5

Note: use `import joblib` directly in later versions of sklearn. `DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib.` – IceTea Dec 15 '19 at 02:54
I tried your method but I got this error I cant solve : " TypeError: can't pickle _thread._local objects " I had exactly the same one with my previous method which was : filehandler = open(file_name, 'wb') pickle.dump(model, filehandler) Have you any idea why? – LucieDevGirl Apr 07 '22 at 18:28

Mike McKerns · Answer 2 · 2015-07-25T14:32:45.150

I am the author of a package called klepto (and also the author of dill). klepto is built to store and retrieve objects in a very simple way, and provides a simple dictionary interface to databases, memory cache, and storage on disk. Below, I show storing large objects in a "directory archive", which is a filesystem directory with one file per entry. I choose to serialize the objects (it's slower, but uses dill, so you can store almost any object), and I choose a cache. Using a memory cache enables me to have fast access to the directory archive, without having to have the entire archive in memory. Interacting with a database or file can be slow, but interacting with memory is fast… and you can populate the memory cache as you like from the archive.

>>> import klepto
>>> d = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> d
dir_archive('stuff', {}, cached=True)
>>> import numpy
>>> # add three entries to the memory cache
>>> d['big1'] = numpy.arange(1000)
>>> d['big2'] = numpy.arange(1000)
>>> d['big3'] = numpy.arange(1000)
>>> # dump from memory cache to the on-disk archive
>>> d.dump()
>>> # clear the memory cache
>>> d.clear()
>>> d
dir_archive('stuff', {}, cached=True)
>>> # only load one entry to the cache from the archive
>>> d.load('big1')
>>> d['big1'][-3:]
array([997, 998, 999])
>>>

klepto provides fast and flexible access to large amounts of storage, and if the archive allows parallel access (e.g. some databases) then you can read results in parallel. It's also easy to share results in different parallel processes or on different machines. Here I create a second archive instance, pointed at the same directory archive. It's easy to pass keys between the two objects, and works no differently from a different process.

>>> f = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> f
dir_archive('stuff', {}, cached=True)
>>> # add some small objects to the first cache  
>>> d['small1'] = lambda x:x**2
>>> d['small2'] = (1,2,3)
>>> # dump the objects to the archive
>>> d.dump()
>>> # load one of the small objects to the second cache
>>> f.load('small2')
>>> f       
dir_archive('stuff', {'small2': (1, 2, 3)}, cached=True)

You can also pick from various levels of file compression, and whether you want the files to be memory-mapped. There are a lot of different options, both for file backends and database backends. The interface is identical, however.

With regard to your other questions about garbage collection and editing of portions of the dictionary, both are possible with klepto, as you can individually load and remove objects from the memory cache, dump, load, and synchronize with the archive backend, or any of the other dictionary methods.

See a longer tutorial here: https://github.com/mmckerns/tlkklp

Get klepto here: https://github.com/uqfoundation

For future readers: you can call `d.dump(); d.clear(); gc.collect()` between each assignment of a numpy array to `d`. This ensures only one numpy array is in memory at a time, useful if the arrays are just big enough to fit in memory (like mine was). — Abhishek Divekar, Jul 16 '17 at 07:02
What happens if you have an existing dictionary which you want to save to disk? For example, if I perform `for key in mydict: d[key] = mydict[key]`, would it copy the data or just hold a reference to it? — Abhishek Divekar, Jul 16 '17 at 07:04
If you have an existing dict, you can pass it to a new `klepto` archive with the `dict` keyword. I believe it should not make a copy. — Mike McKerns, Jul 17 '17 at 20:07
do you mind adding some example code to your answer? I believe it would be of benefit for future readers. — Abhishek Divekar, Jul 17 '17 at 20:20
@abhidivekar: just replace the `{}` with a non-empty dict in the code above. That should do it. — Mike McKerns, Jul 18 '17 at 16:38
This is nice. but no much documentation. how to load a set in iteration without loading everything to memory? I want to use one file but load things progressively — Dreaded semicolon, Jan 29 '20 at 05:05
@MikeMcKerns I would like to dump a list of arrays. Couldn't achieve it with `pickle` and `dill`. Also `hickle` was left overnight but seems to be dumping continuously without stop. Not sure if `kelpto` is appropriate for a list of arrays or dictionary onlr. https://stackoverflow.com/questions/60577147/python-object-serialization-having-issue-with-pickle-vs-hickle — arilwan, Mar 07 '20 at 13:28
@arilwan: Not sure why you couldn't dump a list of arrays with `dill` unless you are hitting some memory limitation. If you want to dump a huge list of arrays, you might want to look at `dask` or `klepto`. `dask` could break up the list into lists of sub-arrays, while `klepto` could break up the list into a dict of sub-arrays (with the key indicating the ordering of the sub-arrays) — Mike McKerns, Mar 07 '20 at 16:52
Thank you for your response. But I tried both `dill` and `klepto` and for all I get `MemoryError`. https://stackoverflow.com/questions/60577147/python-object-serialization-having-issue-with-pickle-vs-hickle — arilwan, Mar 07 '20 at 17:04

score 5 · Answer 3 · answered Mar 13 '17 at 22:38

5

None of the above answers worked for me. I ended up using Hickle which is a drop-in replacement for pickle based on HDF5. Instead of saving it to a pickle it's saving the data to HDF5 file. The API is identical for most use cases and it has some really cool features such as compression.

pip install hickle

Example:

# Create a numpy array of data
array_obj = np.ones(32768, dtype='float32')

# Dump to file
hkl.dump(array_obj, 'test.hkl', mode='w')

# Load data
array_hkl = hkl.load('test.hkl')

answered Mar 13 '17 at 22:38

gidim

2,314
20
23

3

Not suitable for python 3.5 – omarflorez Apr 14 '17 at 01:29
1

`klepto` also can dump to HDF5, similar to `hickle` and is python 3.x compatible. – Mike McKerns May 31 '18 at 18:43
@omarflorez Hickle is now compatible with Python 3 – Pav Sidhu Feb 21 '20 at 18:32

score 4 · Answer 4 · answered May 06 '15 at 18:28

4

I had memory error and resolved it by using protocol=2:

cPickle.dump(obj, file, protocol=2)

answered May 06 '15 at 18:28

denfromufa

5,610
13
81
138

score 2 · Answer 5 · answered Jul 07 '13 at 19:12

If your key and values are string, you can use one of the embedded persistent key-value storage engines available in Python standard library. Example from the anydbm module docs:

import anydbm

# Open database, creating it if necessary.
db = anydbm.open('cache', 'c')

# Record some values
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'

# Loop through contents.  Other dictionary methods
# such as .keys(), .values() also work.
for k, v in db.iteritems():
    print k, '\t', v

# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4

# Close when done.
db.close()

Also in the standard library is the [shelve](https://docs.python.org/2/library/shelve.html#module-shelve) module, which lets us open a "shelf" dict-like object, which uses `anydbm` underneath to store arbitrary pickleable objects as values (the keys are still strings). So, pickling and unpickling happens at the granularity of values. By default, a shelf persists values to disk whenever we assign to one of its keys. — Evgeni Sergeev, Dec 11 '17 at 04:07

score 2 · Answer 6 · answered Jul 11 '13 at 15:16

2

Have you tried using streaming pickle: https://code.google.com/p/streaming-pickle/

I have just solved a similar memory error by switching to streaming pickle.

answered Jul 11 '13 at 15:16

Chris Wheadon

840
6
17

streaming-pickle does not appear well-suited to the present case of just one single dictionary object. When it is better suited (like for a huge list), normal pickle can do the trick as well, because you can dump several pickles into the same file one after the other, see question ["Saving and loading multiple objects in python pickle file"](http://stackoverflow.com/questions/20716812/saving-and-loading-multiple-objects-in-python-pickle-file/28745948) – Lutz Prechelt Feb 27 '15 at 13:04

score 2 · Answer 7 · answered Dec 11 '13 at 14:16

2

How about this?

import cPickle as pickle
p = pickle.Pickler(open("temp.p","wb")) 
p.fast = True 
p.dump(d) # d could be your dictionary or any file

answered Dec 11 '13 at 14:16

richie

17,568
19
51
70

I tried this, and it pickled sucessfully, but I get a 'ValueError: insecure string pickle'-error when I try to unpickle it with pickle.load(open( "temp.p", "rb" )). I read that this could be because of non-closing of the pickle, but the Pickler instance has no attribute 'close'. Could you help me find out how to unpickle the file again? – ROIMaison Oct 08 '14 at 06:27
could it be that the file object is never closed? – Andrew Scott Evans Jul 23 '15 at 01:34

score 2 · Answer 8 · answered Jul 24 '15 at 18:58

I recently had this problem. After trying cpickle with ASCII and the binary protocol 2, I found that my SVM from sci-kit learn trained on 20+ gb of data was not pickling due to a memory error. However, the dill package seemed to resolve the issue. Dill will not create many improvements for a dictionary but may help with streaming. It is meant to stream pickled bytes across a network.

import dill

with open(path,'wb') as fp:
    dill.dump(outpath,fp)
    dill.load(fp)

If efficiency is an issue, try loading/saving to a database. In this instance, your storage solution may be an issue. At 123 mb Pandas should be fine. However, if the machine has limited memory SQL offers fast,optimized, bag operations over data, usually with multithreaded support. My poly kernel svm saved.

I tried this, when about 250MB was dumped the script was interrupted with `Memory Error`mesage: https://stackoverflow.com/questions/60577147/python-object-serialization-having-issue-with-pickle-vs-hickle — arilwan, Mar 07 '20 at 13:29
This seems like a reliable package which extends Python’s `pickle` module for serializing and de-serializing Python objects to the majority of the built-in Python types. tried it for saving and loading pandas dataframe, python dictionary and TfidfVectorizer from sklearn! — Farid Alijani, May 03 '23 at 15:47

score 1 · Answer 9 · answered Jul 16 '17 at 05:13

1

This may seem trivial, but try to use the 64bit Python if you are not.

answered Jul 16 '17 at 05:13

lyron

246
2
7

score 0 · Answer 10 · answered Apr 07 '22 at 06:05

I have tried the following solution, but all of them can't resolve my problem.

Using hickle to replace pickle
Using joblib to replace pickle
Using sklearn.externals joblib to replace pickle
Change the pickle mode

Provide a different method for this issue:

Finally, I found the root cause is that the work directory folder was too long.
So that I change the directory to a very short structure.

Enjoy it.

Pickle dump huge file without memory error

10 Answers10

Linked