0

I am working on attribution modelling on millions of records so want to parallelize this function on pandas dataframe

from multiprocessing import Pool

def paths_gen(df):
    for p in df.index:
        for q in df.columns[:-1]:
            if df.at[p,q]!='empty':
                df.at[p,'Path']= str(df.at[p,'Path'])+str(df.at[p,q])+' > '
    return df

pool = Pool(4)

results=pool.map(paths_gen,data)

But its stuck forever can anybody help me

  • 1
    Iterating through pandas objects is generally slow! Look [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#iteration) and [here](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758) for possible alternatives. – baccandr Sep 09 '19 at 06:27
  • probably it checks every row in separated process and sending and receiving so many rows may take so long time. Better split data in few small parts and then send every part to separated process which should works with all rows in part. But it is better to write it without `for`-loops and then it will use internal code written in C/C++ which should work much faster then Python's loop. – furas Sep 10 '19 at 01:15

0 Answers0