I have a pandas dataframe with 5M rows and 20+ columns. I want do some calculations in for loop as in below sample,
grp_list=df.GroupName.unique()
df2 = pd.DataFrame()
for g in grp_list:
tmp_df = df.loc[(df['GroupName']==g)]
for i in range(len(tmp_df.GroupName)):
# calls another function
res=my_func(tmp_df)
tmp_df['Result'] = res
df2 = df2.append(tmp_df, ignore_index=True)
There are ~900 distinct GroupName
. In order to improve the performance, I want to parallelize the first for loop as it is independent for each GroupName
and append the result to a output data frame. How can I effectively do it with multiprocessing
with group by on GroupName
with final output as a appended dataframe.