7

Problem Statement

I'm using python3 and trying to pickle a dictionary of IntervalTrees which weighs something like 2 to 3 GB. This is my console output:

10:39:25 - project: INFO - Checking if motifs file was generated by pickle...
10:39:25 - project: INFO -   - Motifs file does not seem to have been generated by pickle, proceeding to parse...
10:39:38 - project: INFO -   - Parse complete, constructing IntervalTrees...
11:04:05 - project: INFO -   - IntervalTree construction complete, saving pickle file for next time.
Traceback (most recent call last):
  File "/Users/alex/Documents/project/src/project.py", line 522, in dict_of_IntervalTree_from_motifs_file
    save_as_pickled_object(motifs, output_dir + 'motifs_IntervalTree_dictionary.pickle')
  File "/Users/alex/Documents/project/src/project.py", line 269, in save_as_pickled_object
    def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))
OSError: [Errno 22] Invalid argument

The line in which I attempt the save is

def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))

The error comes maybe 15 minutes after save_as_pickled_object is invoked (at 11:20).

I tried this with a much smaller subsection of the motifs file and it worked fine, with all of the exact same code, so it must be an issue of scale. Are there any known bugs with pickle in python 3.6 relating to the scale of what you try to pickle? Are there known bugs with pickling large files in general? Are there any known ways around this?

Thanks!

Update: This question might be a duplicate of Python 3 - Can pickle handle byte objects larger than 4GB?

Solution

This is the code I used instead.

def save_as_pickled_object(obj, filepath):
    """
    This is a defensive way to write pickle.write, allowing for very large files on all platforms
    """
    max_bytes = 2**31 - 1
    bytes_out = pickle.dumps(obj)
    n_bytes = sys.getsizeof(bytes_out)
    with open(filepath, 'wb') as f_out:
        for idx in range(0, n_bytes, max_bytes):
            f_out.write(bytes_out[idx:idx+max_bytes])


def try_to_load_as_pickled_object_or_None(filepath):
    """
    This is a defensive way to write pickle.load, allowing for very large files on all platforms
    """
    max_bytes = 2**31 - 1
    try:
        input_size = os.path.getsize(filepath)
        bytes_in = bytearray(0)
        with open(filepath, 'rb') as f_in:
            for _ in range(0, input_size, max_bytes):
                bytes_in += f_in.read(max_bytes)
        obj = pickle.loads(bytes_in)
    except:
        return None
    return obj
Alex Lenail
  • 12,992
  • 10
  • 47
  • 79
  • Hmmm... is the filepath valid? Please print it before saving. – Udi Mar 07 '17 at 16:57
  • Yeah, I've tried that. It's definitely a valid filepath. Like I said, "I tried this with a much smaller subsection of the motifs file and it worked fine, with all of the exact same code, so it must be an issue of scale." I did a run where I changed *just* the input file size. – Alex Lenail Mar 07 '17 at 17:26
  • Do you happen to use a FAT filesystem?? – Udi Mar 07 '17 at 18:03
  • @Udi I use whatever ships with macOS these days. (so I don't think so?) – Alex Lenail Mar 07 '17 at 18:05
  • @AlexLenail: Default FS on macOS is Journaled HFS+ -- verify that using `diskutil info -all | grep "File System Personality"` – Matthew Cole Mar 10 '17 at 18:41
  • How big is your swap partition? Can you increase its size to a multiple of your IntervalTree's size and try again? – Matthew Cole Mar 10 '17 at 18:44
  • @MatthewCole `File System Personality: Journaled HFS+` – Alex Lenail Mar 10 '17 at 19:46

1 Answers1

3

Alex, if I am not mistaken this bug report perfectly describes your issue.

http://bugs.python.org/issue24658

As a workaround, I think you can pickle.dumps instead of pickle.dump and then write to your file in chunks of size smaller than 2**31.

Giannis Spiliopoulos
  • 2,628
  • 18
  • 27
  • 1
    Hi! Thanks for bringing this to my attention. If this workaround works for me I'll award you the bounty. =) It takes a little while to test so give me an hour... – Alex Lenail Mar 10 '17 at 19:45