0

I have a pandas dataframe with 5M rows and 20+ columns. I want do some calculations in for loop as in below sample,

grp_list=df.GroupName.unique()
df2 = pd.DataFrame()

for g in grp_list:
    tmp_df = df.loc[(df['GroupName']==g)]
    
    for i in range(len(tmp_df.GroupName)):
        # calls another function
        res=my_func(tmp_df)

    tmp_df['Result'] = res
    df2 = df2.append(tmp_df, ignore_index=True)  
  

There are ~900 distinct GroupName. In order to improve the performance, I want to parallelize the first for loop as it is independent for each GroupName and append the result to a output data frame. How can I effectively do it with multiprocessing with group by on GroupName with final output as a appended dataframe.

1 Answers1

0

First, you can try:

out = []
for _, g in df.groupby("GroupName"):
    res = my_func(g)
    out.append(res)

final_df = pd.concat(out)

This should speed your computation significantly.

If you want to use multiprocessing (but it depends on your computation inside my_func if it speeds up the things) you can use next example:

import multiprocessing

def my_func(df):
   # modify df here
   # ...
   return df

if __name__ == "__main__":
    with multiprocessing.Pool() as pool:
        groups = (g for _, g in df.groupby("GroupName"))
        out = []
        for res in pool.imap_unordered(my_func, groups):
            out.append(res)
    final_df = pd.concat(out)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91