_pickle in python3 doesn't work for large data saving

Question

I am trying to apply _pickle to save data onto disk. But when calling _pickle.dump, I got an error

OverflowError: cannot serialize a bytes object larger than 4 GiB

Is this a hard limitation to use _pickle? (cPickle for python2)

score 127 · Answer 1 · answered Apr 17 '15 at 16:23

127

Not anymore in Python 3.4 which has PEP 3154 and Pickle 4.0
https://www.python.org/dev/peps/pep-3154/

But you need to say you want to use version 4 of the protocol:
https://docs.python.org/3/library/pickle.html

pickle.dump(d, open("file", 'w'), protocol=4)

answered Apr 17 '15 at 16:23

Eric Levieil

3,554
2
13
18

is it a good way to open file this way? I mean without closing it. – 1a1a11a May 21 '18 at 18:32
5

@1a1a11a It would be good practice to open the file using a 'with' statement to ensure that the file gets closed. However, the reference count to the file object drops to zero as soon as the call to pickle.dump returns, so it will get garbage collected right away, and the file will be closed anyway. – jlund3 Nov 05 '18 at 00:53
@jlund3 Thanks for that. I already wondered what on earth the use of "with" is, if Python has a garbage collector. It's all about scoping, I guess. – wessel Oct 11 '19 at 12:21

Martijn Pieters · Answer 2 · 2015-04-17T16:12:11.773

6

Yes, this is a hard-coded limit; from save_bytes function:

else if (size <= 0xffffffffL) {
    // ...
}
else {
    PyErr_SetString(PyExc_OverflowError,
                    "cannot serialize a bytes object larger than 4 GiB");
    return -1;          /* string too large */
}

The protocol uses 4 bytes to write the size of the object to disk, which means you can only track sizes of up to 2³² == 4GB.

If you can break up the bytes object into multiple objects, each smaller than 4GB, you can still save the data to a pickle, of course.

edited Apr 17 '15 at 16:12

answered Apr 17 '15 at 16:05

Martijn Pieters

1,048,767
296
4,058
3,343

Thank you! is it possible to save large file on disk and circumvent this limit? – Jake0x32 Apr 17 '15 at 16:10
@Jake0x32: not with pickle; this is a hard limit in the protocol. Break up your `bytes` object into smaller pieces. – Martijn Pieters Apr 17 '15 at 16:11
1

@MartijnPieters I have the same problem while trying to pickle a classifier `from sklearn.svm import SVC`. How would I break the object into bytes and then pickle? – CIsForCookies Jan 03 '18 at 09:11

score 1 · Answer 3 · answered Nov 09 '15 at 15:04

There is a great answers above for why pickle doesn't work. But it still doesn't work for Python 2.7, which is a problem if you are are still at Python 2.7 and want to support large files, especially NumPy (NumPy arrays over 4G fail).

You can use OC serialization, which has been updated to work for data over 4Gig. There is a Python C Extension module available from:

http://www.picklingtools.com/Downloads

Take a look at the Documentation:

http://www.picklingtools.com/html/faq.html#python-c-extension-modules-new-as-of-picklingtools-1-6-0-and-1-3-3

But, here's a quick summary: there's ocdumps and ocloads, very much like pickle's dumps and loads::

from pyocser import ocdumps, ocloads
ser = ocdumps(pyobject)   : Serialize pyobject into string ser
pyobject = ocloads(ser)   : Deserialize from string ser into pyobject

The OC Serialization is 1.5-2x faster and also works with C++ (if you are mixing langauges). It works with all built-in types, but not classes (partly because it is cross-language and it's hard to build C++ classes from Python).

_pickle in python3 doesn't work for large data saving

3 Answers3

Linked