11

We are running a script using the multiprocessing library (python 3.6), where a big pd.DataFrame is passed as an argument to a function :

from multiprocessing import Pool
import time 

def my_function(big_df):
    # do something time consuming
    time.sleep(50)

if __name__ == '__main__':
    with Pool(10) as p:
        res = {}
        output = {}
        for id, big_df in some_dict_of_big_dfs:
            res[id] = p.apply_async(my_function,(big_df ,))
        output = {id : res[id].get() for id in id_list}

The problem is that we are getting an error from the pickle library.

Reason: 'OverflowError('cannot serialize a bytes objects larger than 4GiB',)'

We are aware than pickle v4 can serialize larger objects question related, link, but we don't know how to modify the protocol that multiprocessing is using.

does anybody know what to do? Thanks !!

Pablo
  • 3,135
  • 4
  • 27
  • 43

2 Answers2

15

Apparently is there an open issue about this topic , and there is a few related initiatives described on this particular answer. I Found a way to change the default pickle protocol that is used in the multiprocessing library based on this answer. As was pointed out in the comments this solution Only works with Linux and OS multiprocessing lib

Solution:

You first create a new separated module

pickle4reducer.py

from multiprocessing.reduction import ForkingPickler, AbstractReducer

class ForkingPickler4(ForkingPickler):
    def __init__(self, *args):
        if len(args) > 1:
            args[1] = 2
        else:
            args.append(2)
        super().__init__(*args)

    @classmethod
    def dumps(cls, obj, protocol=4):
        return ForkingPickler.dumps(obj, protocol)


def dump(obj, file, protocol=4):
    ForkingPickler4(file, protocol).dump(obj)


class Pickle4Reducer(AbstractReducer):
    ForkingPickler = ForkingPickler4
    register = ForkingPickler4.register
    dump = dump

And then, in your main script you need to add the following:

import pickle4reducer
import multiprocessing as mp
ctx = mp.get_context()
ctx.reducer = pickle4reducer.Pickle4Reducer()

with mp.Pool(4) as p:
    # do something

That will probably solve the problem of the overflow.

But, warning, you might consider reading this before doing anything or you might reach the same error as me:

'i' format requires -2147483648 <= number <= 2147483647

(the reason of this error is well explained in the link above). Long story short, multiprocessing send data through all its process using the pickle protocol, if you are already reaching the 4gb limit, that probably means that you might consider redefining your functions more as "void" methods rather than input/output methods. All this inbound/outbound data increase the RAM usage, is probably inefficient by construction (my case) and it might be better to point all process to the same object rather than create a new copy for each call.

hope this helps.

Pablo
  • 3,135
  • 4
  • 27
  • 43
  • I got the following error: `src\pickle4reducer.py", line 6, in __init__ args[1] = 2 TypeError: 'tuple' object does not support item assignment` Any suggestions? – Vítor Lourenço Jul 31 '18 at 14:53
  • @BarbarianSpock in what OS are you working on? (we tried same code in mac and worked fine) I think that the implementation of the `multiprocessing` library is different on windows. That might be the reason. – Pablo Jul 31 '18 at 21:11
  • 1
    I'm trying to understand the problem..i've tried the code in two environments. In one of them everything works fine (linux) (that's why i posted that code), and another (windows) and I got the same error as you... (probably because the *args var changes). Apparently both env should be similar (created with almost the same conda req list) but I still can't understand why works in one env and not in another. – Pablo Aug 01 '18 at 13:28
  • 1
    We were using a windows enviroment, however after your explanation we tested it in a linux enviroment using python 3.6 (the default SO python was 3.5, but the `multiprocessing` seems different). We managed it to work. – Vítor Lourenço Aug 02 '18 at 11:24
  • Depending on the start method, this patch will fail or not. With `get_context() or get_context('fork')` everything works great, but with `get_context('spwan')` or `get_context('forkserver')` it does not work. – GuillaumeA Aug 12 '19 at 19:36
  • @pablo in line `ns.df = big_df` ,I am getting error : `mgr = ctx.Manager()` `ns = mgr.Namespace()` `ns.df = big_df` Error: `cls(buf, protocol).dump(obj) OverflowError: cannot serialize a bytes object larger than 4 GiB` – piyush-balwani Oct 05 '21 at 14:36
  • As @hongkail says in the next answer, I've solved both problems by updating to python from python 3.7 to python 3.8 – VictorCB Mar 15 '23 at 18:41
1

Supplementing answer from Pablo

The following problem can be resolved be Python3.8, if you are okay to use this version of python:

'i' format requires -2147483648 <= number <= 2147483647
hongkail
  • 679
  • 1
  • 10
  • 17