Shared-memory pandas data frame object in joblib.parallel

Question

I'm using parallel function from joblib to parallelize a task. All processes take as input a pandas dataframe. In order to reduce the run-time memory used it is possible to sharing this dataframe? All processes read-only on it. I found a similar solution but for a numpy array and using multiprocessing here: Shared-memory objects in multiprocessing

this is the snippet of the code:

from joblib import Parallel, delayed

def fun(df, cat):

    a = df[ df[ y ] != cat ]
    b = df[ df[ y ] == cat ]
    ...

output = Parallel(n_jobs=-1)(delayed(func())(df, cat) for cat in labels )

df is a pandas dataframe and labels is just a list.

@khaledkoubaa I can't. Some inputs are needed and they are complicated to obatin. The central part of the function can't do anything. The idea is there. I want just to learn how to share df. — Will, Sep 20 '22 at 14:56
ok, this function take dataframes as input or columns ? I mean you want to create new column/modify column of a specific dataframe in parallel? what is the goal of this function (what it will return: column ? another dataframe?) ? — khaled koubaa, Sep 20 '22 at 15:01
[use shared memory to access large dataframe from different processes](https://stackoverflow.com/q/73028092/16310741) — Charchit Agarwal, Sep 20 '22 at 16:38

score 1 · Accepted Answer · answered Sep 21 '22 at 09:21

I solved passing directly the filter dataframes

output = Parallel(n_jobs=-1)(delayed(func)(df[ df[ target ] == cat ], 
                                           df[ df[ target ] !=  cat ], 
                                            cat) for cat in labels )

Shared-memory pandas data frame object in joblib.parallel

1 Answers1