Telling python multiprocessing which pickle protocol should be used for serialization

Question

How can I tell python multiprocessing which pickle protocol should be used for serialization?

The problem is that I want to process a large pandas DataFrame in a parallel process using multiprocessing.Pool.apply_async and I get the following error as a result:

OverflowError: cannot serialize a bytes object larger than 4 GiB

According to this link https://github.com/stan-dev/pystan/issues/197, the problem might be due to multiprocessing using the default version of pickle protocol (pickle.DEFAULT_PROTOCOL=3), while protocol=pickle.HIGHEST_PROTOCOL=4 can be used to avoid the problem.

Thank you for your help!

UPDATE:

The following solution candidate (based on the link referenced by @juanpa.arrivillaga)

from multiprocessing.reduction import ForkingPickler

class ForkingPicklerHighest(ForkingPickler):
    @classmethod
    def dumps(cls, obj, protocol=None):
        return ForkingPickler.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)

multiprocessing.connection._ForkingPickler = ForkingPicklerHighest

produces the following error:

struct.error: 'i' format requires -2147483648 <= number <= 2147483647

and so a better (i.e. working) solution is needed!

There is a PR for `multiprocessing` that may solve the issue: https://github.com/python/cpython/pull/10305 — katosh, Aug 16 '19 at 19:14
I believe, my final solution was to use [joblib](https://joblib.readthedocs.io) instead of multiprocessing, and instead of transferring large objects, I used [joblib.dump](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html) and [joblib.load](https://joblib.readthedocs.io/en/latest/generated/joblib.load.html) to save them into a file and then read in the worker processes. — S.V, Aug 16 '19 at 19:14

Telling python multiprocessing which pickle protocol should be used for serialization

0 Answers0