0

I'm trying to revisit this slightly older question and see if there's a better answer these days.

I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).

def func(i):
    return i*2

def par_func_dict(mydict):

    values = mydict['values']
    df = mydict['df']

    return pd.Series([func(i) for i in values])

N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)

df = pd.DataFrame(np.random.randn(10,10))

pool = Pool(cores)

gen = ({'values' : i, 'df' : df} 
       for i in data_split)

data = pd.concat(pool.map(par_func_dict,gen), axis=0)

pool.close()
pool.join()

I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.

The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.

David
  • 1,454
  • 3
  • 16
  • 27
  • The comments on the other thread don't say how they are measuring the memory usage... many platforms will include shared memory when reporting the memory usage of a process. – soundstripe Nov 17 '18 at 19:25
  • I guess one possible solution is to make `df` a global variable. Not ideal, but as long as none of the processes actually edit it, I think it should be okay. – David Nov 17 '18 at 19:33
  • 1
    I would also look into other ways to parallelize the work, such as [`dask`](http://docs.dask.org/en/latest/dataframe.html). – soundstripe Nov 17 '18 at 19:46
  • @soundstripe: I've seen a few talks on `dask`. Haven't had a chance to experiment with it yet, thanks for the suggestion. – David Nov 17 '18 at 19:54
  • 1
    You can also try joblib : https://joblib.readthedocs.io/en/latest/parallel.html . See using shared objects with joblib here : https://stackoverflow.com/questions/46657885/how-to-write-to-a-shared-variable-in-python-joblib – user238607 Nov 18 '18 at 13:53

0 Answers0