How can I tell python multiprocessing which pickle protocol should be used for serialization?
The problem is that I want to process a large pandas DataFrame in a parallel process using multiprocessing.Pool.apply_async
and I get the following error as a result:
OverflowError: cannot serialize a bytes object larger than 4 GiB
According to this link https://github.com/stan-dev/pystan/issues/197, the problem might be due to multiprocessing using the default version of pickle protocol (pickle.DEFAULT_PROTOCOL=3), while protocol=pickle.HIGHEST_PROTOCOL=4 can be used to avoid the problem.
Thank you for your help!
UPDATE:
The following solution candidate (based on the link referenced by @juanpa.arrivillaga)
from multiprocessing.reduction import ForkingPickler
class ForkingPicklerHighest(ForkingPickler):
@classmethod
def dumps(cls, obj, protocol=None):
return ForkingPickler.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
multiprocessing.connection._ForkingPickler = ForkingPicklerHighest
produces the following error:
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
and so a better (i.e. working) solution is needed!