13

Is there any option to load a pickle file in chunks?

I know we can save the data in CSV and load it in chunks. But other than CSV, is there any option to load a pickle file or any python native file in chunks?

Jamie
  • 1,530
  • 1
  • 19
  • 35
Naren Babu R
  • 453
  • 2
  • 9
  • 33
  • 2
    Are you the one pickling, or are just given a dump? If you are doing the pickling, give a short example of your data and how you pickle it. – kabanus Jan 30 '20 at 09:52
  • 3
    Hello! You can't read pickle file by chunk, but you can use [hdf](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_hdf.html?highlight=read_hdf#pandas.read_hdf) format for this. – Grigory Skvortsov Jan 30 '20 at 09:58

3 Answers3

1

Based on the documentation for Python pickle, there is not currently support for chunking.

However, it is possible to split data into chunks and then read in chunks. For example, suppose the original structure is

import pickle

filename = "myfile.pkl"
str_to_save = "myname"

with open(filename,'wb') as file_handle:
    pickle.dump(str_to_save, file_handle)
    
with open(filename,'rb') as file_handle:
    result = pickle.load(file_handle)

print(result)

That could be split into two separate pickle files:

import pickle

filename_1 = "myfile_1.pkl"
filename_2 = "myfile_2.pkl"
str_to_save = "myname"

with open(filename_1,'wb') as file_handle:
    pickle.dump(str_to_save[0:4], file_handle)
with open(filename_2,'wb') as file_handle:
    pickle.dump(str_to_save[4:], file_handle)
    
with open(filename_1,'rb') as file_handle:
    result = pickle.load(file_handle)

print(result)

As per AKX's comment, writing multiple data to a single file also works:

import pickle

filename = "myfile.pkl"
str_to_save = "myname"

with open(filename,'wb') as file_handle:
    pickle.dump(str_to_save[0:4], file_handle)
    pickle.dump(str_to_save[4:], file_handle)
    
with open(filename,'rb') as file_handle:
    result = pickle.load(file_handle)
    print(result)
    result = pickle.load(file_handle)
    print(result)
Ben
  • 563
  • 1
  • 5
  • 12
  • 2
    Pickle objects can be concatenated into a single file (and read as such). That is, you can just `pickle.dump()` multiple times into the same file. – AKX Jun 07 '21 at 18:35
  • Cool, I didn't think of that. I've added a new snippet and cited your comment. – Ben Jun 07 '21 at 18:47
0

I had a similar issue, where I wrote a barrel file descriptor pool, and noticed that my pickle files were getting corrupt when I closed a file descriptor. Although you may do multiple dump() operations to an open file descriptor, it's not possible to subsequently do an open('file', 'ab') to start saving a new set of objects.

I got around this by doing a pickler.dump(None) as a session terminator right before I had to close the file descriptor, and upon re-opening, I instantiated a new Pickler instance to resume writing to the file.

When loading from this file, a None object signified an end-of-session, at which point I instantiated a new Pickler instance with the file descriptor to continue reading the remainder of the multi-session pickle file.

This only applies if for some reason you have to close the file descriptor, though. Otherwise, any number of dump() calls can be performed for load() later.

Ben Y
  • 913
  • 6
  • 18
0

As far as I understand Pickle, load/dump by chunk is not possible. Pickle intrinsically reads a complete data stream by "chunks" of variable length depending on flags within the data stream. That is what serialization is all about. This datastream itself could have been cut in chunk earlier (say, network transfer), but chunks cannot be pickle/unpickled "on the fly".


But maybe something intermediate can be achieved with pickle "buffers" and "out of band" features for very large data.

Note this is not exactly a pickle load/save a single pickle file in chunks. It only applies to objects met during the serialization process that declare themselves has being "out of band" (serialized separately).

Quoting the Pickler class doc:

If buffer_callback is not None, then it can be called any number of times with a buffer view. If the callback returns a false value (such as None), the given buffer is out-of-band; otherwise the buffer is serialized in-band, i.e. inside the pickle stream. (emphasis mine)

Quoting the "Out of band" concept doc:

In some contexts, the pickle module is used to transfer massive amounts of data. Therefore, it can be important to minimize the number of memory copies, to preserve performance and resource consumption. However, normal operation of the pickle module, as it transforms a graph-like structure of objects into a sequential stream of bytes, intrinsically involves copying data to and from the pickle stream.

This constraint can be eschewed if both the provider (the implementation of the object types to be transferred) and the consumer (the implementation of the communications system) support the out-of-band transfer facilities provided by pickle protocol 5 and higher.

Example taken from the doc example :

b = ZeroCopyByteArray(b"abc") # NB: class has a special __reduce_ex__ and _reconstruct method
buffers = []
data = pickle.dumps(b, protocol=5, buffer_callback=buffers.append)
# we could do things with these buffers like:
#  - writing each to a single file,
#  - sending them over network,
# ...
new_b = pickle.loads(data, buffers=buffers) # load in chunks

From this example, we could consider writing each buffer into a file, or sending each on a network. Then unpickling would be performed by loading those files (or network payloads) and passing to the unpickle.

But note that we end up with 2 serialized data in the example:

  • data
  • buffers

Not really the OP desire, not exactly pickle load/dump by chunks.

From a pickle-to-a-single-file perspective, I don't think this gives any benefit, because we would have to define a custom method to pack into a file both data and buffers, i.e. define a new data format ... feels like ruining the pickle initial benefits.


Quoting Unpickler constructor doc:

If buffers is not None, it should be an iterable of buffer-enabled objects that is consumed each time the pickle stream references an out-of-band buffer view. Such buffers have been given in order to the buffer_callback of a Pickler object. Changed in version 3.8: The buffers argument was added.

LoneWanderer
  • 3,058
  • 1
  • 23
  • 41