67

Based on this comment and the referenced documentation, Pickle 4.0+ from Python 3.4+ should be able to pickle byte objects larger than 4 GB.

However, using python 3.4.3 or python 3.5.0b2 on Mac OS X 10.10.4, I get an error when I try to pickle a large byte array:

>>> import pickle
>>> x = bytearray(8 * 1000 * 1000 * 1000)
>>> fp = open("x.dat", "wb")
>>> pickle.dump(x, fp, protocol = 4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

Is there a bug in my code or am I misunderstanding the documentation?

smci
  • 32,567
  • 20
  • 113
  • 146
RandomBits
  • 4,194
  • 1
  • 17
  • 30
  • There's no problem for me. Python 3.4.1 on Windows. – Jedi Jul 17 '15 at 03:59
  • 5
    Breaks on OS X. This doesn't actually have anything to do with pickle. `open('/dev/null', 'wb').write(bytearray(2**31 - 1))` works, but `open('/dev/null', 'wb').write(bytearray(2**3))` throws that error. Python 2 doesn't have this issue. – Blender Jul 17 '15 at 04:13
  • @Blender: What throws an error for you works for me with both Python 2.7.10 and Python 3.4.3 (on OS X, MacPorts versions). – Eric O. Lebigot Jul 17 '15 at 04:15
  • @EOL: I'm using Homebrew's Python. – Blender Jul 17 '15 at 04:17
  • @RandomBits: Did you install Python 3 from Homebrew? – Blender Jul 17 '15 at 04:21
  • 1
    @Blender, @EOL `open('/dev/null','wb').write(bytearray(2**31)` fails for me as well with the MacPort's python 3.4.3. – RandomBits Jul 17 '15 at 04:25
  • I see: there is a typo in Blender's comment (`(3)` instead of `(31)`, which makes more sense given the context). With this change, I observe the same behavior as @Blender. – Eric O. Lebigot Jul 17 '15 at 04:34
  • @EOL: Both work fine on 2.7.9 from homebrew and the stock OS X binary. – Blender Jul 17 '15 at 04:49
  • @Blender: Same for MacPorts' Python 2.7.10 on OS X 10.10. – Eric O. Lebigot Jul 18 '15 at 02:23
  • Blender's test above (with `2**31` instead of `2**3`) shows that there is a bug in Python 3.4.3 (Homebrew and MacPorts) on OS X: `open()` should be able to write a 4 GB file. I'll check whether this has been reported, and I will file a bug report if not. – Eric O. Lebigot Jul 18 '15 at 02:29
  • 4
    Bug reported: http://bugs.python.org/issue24658. – Eric O. Lebigot Jul 18 '15 at 03:00
  • I think this may have nothing to do with Python 3.4.x itself but how you compiled your interpreter -- I have no issues on Mac OS X btw –  Jul 22 '15 at 05:47
  • I'm voting to close this question as off-topic because this is a bug in Python. We cannot solve it, only work around it. – Kevin Jul 24 '15 at 16:16
  • @Kevin So what's a work around for pickling and un-pickling large files? The bug doesn't appear to be getting resolved. – Ian Dec 16 '15 at 16:18
  • @Ian: If you *know* you just have a `bytes` object, you can and should just write it out as-is (i.e. `with open(something) as f: f.write(your_data_here)`; perhaps prepend a length field using `struct.pack()`). It's only when you need to preserve type information or send something more complex than pure `bytes` objects that pickling becomes necessary. Even then, you can often get away with JSON or another, simpler format. – Kevin Dec 19 '15 at 21:19

7 Answers7

39

Here is a simple workaround for issue 24658. Use pickle.loads or pickle.dumps and break the bytes object into chunks of size 2**31 - 1 to get it in or out of the file.

import pickle
import os.path

file_path = "pkl.pkl"
n_bytes = 2**31
max_bytes = 2**31 - 1
data = bytearray(n_bytes)

## write
bytes_out = pickle.dumps(data)
with open(file_path, 'wb') as f_out:
    for idx in range(0, len(bytes_out), max_bytes):
        f_out.write(bytes_out[idx:idx+max_bytes])

## read
bytes_in = bytearray(0)
input_size = os.path.getsize(file_path)
with open(file_path, 'rb') as f_in:
    for _ in range(0, input_size, max_bytes):
        bytes_in += f_in.read(max_bytes)
data2 = pickle.loads(bytes_in)

assert(data == data2)
lunguini
  • 896
  • 1
  • 9
  • 14
  • 4
    Thank you. This helped greatly. One thing: for `write` Should `for idx in range(0, n_bytes, max_bytes):` be `for idx in range(0, len(bytes_out), max_bytes):` – naoko Jun 22 '17 at 15:50
  • 1
    @lunguini, for the write chunk, instead of `range(0, n_bytes, max_bytes)`, should it be `range(0, len(bytes_out), max_bytes)`? Reason I'm suggesting this is (on my machine, anyway), `n_bytes = 1024`, but `len(bytes_out) = 1062`, and for others coming to this solution, you're only using the length of your example file, which isn't necessarily useful for real-world scenarios. – seaders Jun 02 '18 at 11:52
25

To sum up what was answered in the comments:

Yes, Python can pickle byte objects bigger than 4GB. The observed error is caused by a bug in the implementation (see Issue24658).

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 21
    How is this issue still not fixed? Insane – dpb Jun 30 '17 at 09:59
  • 15
    It's 2018 and the bug is still there. Does anyone know why? – Calvin Ku Jan 17 '18 at 08:54
  • 2
    It’s been fixed for [3.6.8](https://github.com/python/cpython/commit/a5ebc205beea2bf1501e4ac33ed6e81732dd0604), [3.7.2](https://github.com/python/cpython/commit/178d1c07778553bf66e09fe0bb13796be3fb9abf) and [3.8](https://github.com/python/cpython/commit/74a8b6ea7e0a8508b13a1c75ec9b91febd8b5557) in October 2018; the issue remains open because the author wanted to backport to 2.7. In 6 weeks time that’ll be moot as Python 2.x reaches EOL. – Martijn Pieters Nov 19 '19 at 02:11
15

Here is the full workaround, though it seems pickle.load no longer tries to dump a huge file anymore (I am on Python 3.5.2) so strictly speaking only the pickle.dumps needs this to work properly.

import pickle

class MacOSFile(object):

    def __init__(self, f):
        self.f = f

    def __getattr__(self, item):
        return getattr(self.f, item)

    def read(self, n):
        # print("reading total_bytes=%s" % n, flush=True)
        if n >= (1 << 31):
            buffer = bytearray(n)
            idx = 0
            while idx < n:
                batch_size = min(n - idx, 1 << 31 - 1)
                # print("reading bytes [%s,%s)..." % (idx, idx + batch_size), end="", flush=True)
                buffer[idx:idx + batch_size] = self.f.read(batch_size)
                # print("done.", flush=True)
                idx += batch_size
            return buffer
        return self.f.read(n)

    def write(self, buffer):
        n = len(buffer)
        print("writing total_bytes=%s..." % n, flush=True)
        idx = 0
        while idx < n:
            batch_size = min(n - idx, 1 << 31 - 1)
            print("writing bytes [%s, %s)... " % (idx, idx + batch_size), end="", flush=True)
            self.f.write(buffer[idx:idx + batch_size])
            print("done.", flush=True)
            idx += batch_size


def pickle_dump(obj, file_path):
    with open(file_path, "wb") as f:
        return pickle.dump(obj, MacOSFile(f), protocol=pickle.HIGHEST_PROTOCOL)


def pickle_load(file_path):
    with open(file_path, "rb") as f:
        return pickle.load(MacOSFile(f))
jpgard
  • 653
  • 7
  • 15
Sam Cohan
  • 203
  • 2
  • 8
9

You can specify the protocol for the dump. If you do pickle.dump(obj,file,protocol=4) it should work.

Yohan Obadia
  • 2,552
  • 2
  • 24
  • 31
4

Reading a file by 2GB chunks takes twice as much memory as needed if bytes concatenation is performed, my approach to loading pickles is based on bytearray:

class MacOSFile(object):
    def __init__(self, f):
        self.f = f

    def __getattr__(self, item):
        return getattr(self.f, item)

    def read(self, n):
        if n >= (1 << 31):
            buffer = bytearray(n)
            pos = 0
            while pos < n:
                size = min(n - pos, 1 << 31 - 1)
                chunk = self.f.read(size)
                buffer[pos:pos + size] = chunk
                pos += size
            return buffer
        return self.f.read(n)

Usage:

with open("/path", "rb") as fin:
    obj = pickle.load(MacOSFile(fin))
markhor
  • 2,235
  • 21
  • 18
  • Will the above code work for any platform? If so, the above code is more like "FileThatAlsoCanBeLoadedByPickleOnOSX" right? Just trying to understand... It's not like if I use `pickle.load(MacOSFile(fin))` on linux this will break, correct? @markhor – Alex Lenail Mar 11 '17 at 14:42
  • Also, would you implement a `write` method? – Alex Lenail Mar 11 '17 at 14:49
1

Had the same issue and fixed it by upgrading to Python 3.6.8.

This seems to be the PR that did it: https://github.com/python/cpython/pull/9937

0

I also found this issue, to solve this problem i chunk the code into several iteration. Let say in this case i have 50.000 data which i have to calc tf-idf and do knn classfication. When i run and directly iterate 50.000 it give me "that error". So, to solve this problem i chunk it.

tokenized_documents = self.load_tokenized_preprocessing_documents()
    idf = self.load_idf_41227()
    doc_length = len(documents)
    for iteration in range(0, 9):
        tfidf_documents = []
        for index in range(iteration, 4000):
            doc_tfidf = []
            for term in idf.keys():
                tf = self.term_frequency(term, tokenized_documents[index])
                doc_tfidf.append(tf * idf[term])
            doc = documents[index]
            tfidf = [doc_tfidf, doc[0], doc[1]]
            tfidf_documents.append(tfidf)
            print("{} from {} document {}".format(index, doc_length, doc[0]))

        self.save_tfidf_41227(tfidf_documents, iteration)
raditya gumay
  • 2,951
  • 3
  • 17
  • 24