Sharing large objects in multiprocessing pools

Question

I'm trying to revisit this slightly older question and see if there's a better answer these days.

I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).

def func(i):
    return i*2

def par_func_dict(mydict):

    values = mydict['values']
    df = mydict['df']

    return pd.Series([func(i) for i in values])

N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)

df = pd.DataFrame(np.random.randn(10,10))

pool = Pool(cores)

gen = ({'values' : i, 'df' : df} 
       for i in data_split)

data = pd.concat(pool.map(par_func_dict,gen), axis=0)

pool.close()
pool.join()

I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.

The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.

The comments on the other thread don't say how they are measuring the memory usage... many platforms will include shared memory when reporting the memory usage of a process. — soundstripe, Nov 17 '18 at 19:25
I guess one possible solution is to make `df` a global variable. Not ideal, but as long as none of the processes actually edit it, I think it should be okay. — David, Nov 17 '18 at 19:33
I would also look into other ways to parallelize the work, such as [`dask`](http://docs.dask.org/en/latest/dataframe.html). — soundstripe, Nov 17 '18 at 19:46
@soundstripe: I've seen a few talks on `dask`. Haven't had a chance to experiment with it yet, thanks for the suggestion. — David, Nov 17 '18 at 19:54
You can also try joblib : https://joblib.readthedocs.io/en/latest/parallel.html . See using shared objects with joblib here : https://stackoverflow.com/questions/46657885/how-to-write-to-a-shared-variable-in-python-joblib — user238607, Nov 18 '18 at 13:53

Sharing large objects in multiprocessing pools

0 Answers0