Is there a way to use multiprocessing on pandas dataframe using this function

Question

I am working on attribution modelling on millions of records so want to parallelize this function on pandas dataframe

from multiprocessing import Pool

def paths_gen(df):
    for p in df.index:
        for q in df.columns[:-1]:
            if df.at[p,q]!='empty':
                df.at[p,'Path']= str(df.at[p,'Path'])+str(df.at[p,q])+' > '
    return df

pool = Pool(4)

results=pool.map(paths_gen,data)

But its stuck forever can anybody help me

Iterating through pandas objects is generally slow! Look [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#iteration) and [here](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758) for possible alternatives. — baccandr, Sep 09 '19 at 06:27
probably it checks every row in separated process and sending and receiving so many rows may take so long time. Better split data in few small parts and then send every part to separated process which should works with all rows in part. But it is better to write it without `for`-loops and then it will use internal code written in C/C++ which should work much faster then Python's loop. — furas, Sep 10 '19 at 01:15

Is there a way to use multiprocessing on pandas dataframe using this function

0 Answers0