0

These days I've been stucked in problem of speeding up groupby&apply,Here is code:

dat = dat.groupby(['glass_id','label','step'])['equip'].apply(lambda x:'_'.join(sorted(list(x)))).reset_index()

which cost large time when data size grows. I've try to change the groupby&apply to for type which didn't work; then I tried to use unique() but still fail to speed up the running time.

I wanna a update code for less run-time,and gonna be very appreciate if there is a solvement to this problem

Beytab
  • 11
  • 2

1 Answers1

0

I think you can consider to use multiprocessing
Check the following example

import multiprocessing
import numpy as np

# The function which you use in conjunction with multiprocessing
def loop_many(sub_df):
  
  grouped_by_KEY_SEQ_and_count=sub_df.groupby(['KEY_SEQ']).agg('count')
  return grouped_by_KEY_SEQ_and_count


# You will use 6 processes (which is configurable) to process dataframe in parallel
NUMBER_OF_PROCESSES=6
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)

# Split dataframe into 6 sub-dataframes
df_split = np.array_split(pre_sale, NUMBER_OF_PROCESSES)

# Process split sub-dataframes by loop_many() on multiple processes
processed_sub_dataframes=pool.map(loop_many,df_split)

# Close multiprocessing pool
pool.close()
pool.join()

concatenated_sub_dataframes=pd.concat(processed_sub_dataframes).reset_index()

YoungMin Park
  • 1,101
  • 1
  • 10
  • 18
  • It raise AttributeError:Can't pickle local object loop_many() firstly , and raising RuntimeError when I take the loop function out of the function belong to – Beytab May 25 '21 at 10:03
  • @Beytab Hmm, it might be from different execution environment because I tested above code on jupyter nootbook. I guess you're running the code on python module. I found some useful links for that error. 1. https://stackoverflow.com/a/58897266/8979023 2. https://stackoverflow.com/a/21345423/8979023 Cheers – YoungMin Park May 25 '21 at 12:08