14

I am processing some data and I have stored the results in three dictionaries, and I have saved them to the disk with Pickle. Each dictionary has 500-1000MB.

Now I am loading them with:

import pickle
with open('dict1.txt', "rb") as myFile:
    dict1 = pickle.load(myFile)

However, already at loading the first dictionary I get:

*** set a breakpoint in malloc_error_break to debug
python(3716,0xa08ed1d4) malloc: *** mach_vm_map(size=1048576) failed (error code=3)
*** error: can't allocate region securely
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1019, in load_empty_dictionary
    self.stack.append({})
MemoryError

How to solve this? My computer has 16GB of RAM so I find it unusual that loading a 800MB dictionary crashes. What I also find unusual is that there were no problems while saving the dictionaries.

Further, in future I plan to process more data resulting in larger dictionaries (3-4GB on the disk), so any advice how to improve the efficiency is appreciated.

flotr
  • 165
  • 1
  • 1
  • 8
  • What OS are you using? Is the size the *on disk* file size or did you measure actual memory use? – Martijn Pieters Jan 21 '15 at 13:55
  • It depends on your OS how much memory a process is allowed to allocate. – Martijn Pieters Jan 21 '15 at 13:55
  • Size is the file size on the disk. I am using Mac OS 10.10. Is there a way to adjust how much memory is allowed to be allocated? – flotr Jan 21 '15 at 13:57
  • 800MB of data doesn't translate to 800MB of memory usage; it could be larger or it could be smaller, but usually larger. How did you produce these pickles in the first place? – Martijn Pieters Jan 21 '15 at 14:03
  • I see... I have produced them with: with open('dict1.txt', 'wb') as dict_items_save: pickle.dump(dict1, dict_items_save, protocol=2) – flotr Jan 21 '15 at 14:05
  • 1
    And how large was your `dict1` then? You'd have to use [`sys.getsizeof()`](https://docs.python.org/2/library/sys.html#sys.getsizeof) recursively to get the memory footprint of that object. That footprint is dependent on OS, and if you are using a 32-bit or 64-bit process. – Martijn Pieters Jan 21 '15 at 14:07
  • Thanks. I ended up rewriting my code to avoid large dictionaries. – flotr Jan 22 '15 at 16:36

3 Answers3

13

If your data in the dictionaries are numpy arrays, there are packages (such as joblib and klepto) that make pickling large arrays efficient, as both the klepto and joblib understand how to use minimal state representation for a numpy.array. If you don't have array data, my suggestion would be to use klepto to store the dictionary entries in several files (instead of a single file) or to a database.

See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file, would like to save/load your data in parallel, or would like to easily experiment with a storage format and backend to see which works best for your case. Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.

As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) --also HDF5 (or a SQL database) is another good option as it allows parallel access. klepto can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed of accessing the data).

klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel. For examples, see the above links.

Community
  • 1
  • 1
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • Thanks for your answer. I have retained pickle, but I have radically modified my code to produce numpy arrays with a considerably smaller footprint. Now it works fine. – flotr Jan 22 '15 at 16:37
  • @Mike I am using `mutliprocessing.Pool` with Pandas. After `apply_async`, if the pandas dataframe is a bit large, it throws me MemoryError. Can I use `klepto` to alleviate that ? – MSS Jul 21 '20 at 09:34
  • @MSS: hard to tell from your question without more detail. Possibly. Klepto can push the data onto disk, and thus out of memory, and give you an interface to access portions of the data at a time. Depending on your use case, I expect either `klepto` or `dask` may help. – Mike McKerns Jul 21 '20 at 11:09
4

This is an inherent problem of pickle, which is intended for use with rather small amounts of data. The size of the dictionaries, when loaded into memory, are many times larger than on disk.

After loading a pickle file of 100MB, you may well have a dictionary of almost 1GB or so. There are some formulas on the web to calculate the overhead, but I can only recommend to use some decent database like MySQL or PostgreSQL for such amounts of Data.

inixmon
  • 41
  • 4
  • Yeah... I knew that the size is not equal, but I didn't expect that this ratio may be 10x... – flotr Jan 22 '15 at 16:37
-3

I supose you use 32bits Python and it has 4GB limited. You should use 64 bits instead of 32 bits. I have try it, my pickled dict beyond 1.7GB, and I didn't get any problem except time goes longer.

Jett
  • 3
  • 3
  • could you clarify your answer more? How? code snippet ? – Farid Alijani Oct 10 '19 at 07:55
  • When I try to use Version 32bits of Python to load pickled dict data excessed to 1.7GB, it was hunged for a long long long time. But version 64bits Python will not. – Jett Dec 01 '19 at 04:51