0

I have a function that has 4 nestled for loops in it. The function takes in a dataframe and returns a new dataframe. Currently the function takes about 2 hours to run, I need it to run in around 30 mins...

I've tried multiprocessing using 4 cores but I cant seem to get it to work. I start by creating a list of my input dataframe split into smaller chunks (list_of_df)

all_trips = uncov_df.TRIP_NO.unique()

list_of_df = []
for trip in all_trips:
    list_of_df.append(uncov_df[uncov_df.TRIP_NO==trip])

I then tried mapping this list of chunks into my function (transform_df) using 4 pools.

from multiprocessing import Pool

if __name__ == "__main__":
    with Pool(4) as p:
        df_uncov = list(p.map(transform_df, list_of_df))
        
df = pd.concat(df_uncov)

When I run the above my code cell freezes and nothing happens. Does anyone know what's going on?

  • Looks about right, are you running out of memory? With four nested for-loops I'd rather look into numba or cython though... – mcsoini Nov 15 '21 at 14:48
  • @mcsoini No problems with memory. Thanks, I'll have a read over numba and cython documentation. – Ramon Santiago Nov 15 '21 at 14:55
  • can you post some of your dataframe that covers a few different trip numbers? just to see if the basic multiprocessing works without going through the full nested function? – Jonathan Leon Nov 15 '21 at 16:45
  • 2
    Are you using a Jupyther notebook (_"code cell"_ sounds like it)? If so, look [here](https://stackoverflow.com/q/47313732/14311263). – Timus Nov 15 '21 at 17:53
  • @Timus Thanks the post you linked got it to work! – Ramon Santiago Nov 16 '21 at 09:28

1 Answers1

0

This is how I set mine up using starmap. This returns a list of dfs to be concatenated later.

#put this above if __name__ == "__main__":
def get_dflist_multiprocess(keys_list, num_proc=4):
    with Pool(num_proc) as p:
        df_list = p.starmap(transform_df, list_of_df)
        p.close()
        p.join()
    return df_list

#then below if __name__ == "__main__":
df_list = get_dflist_multiprocess(list_of_df, num_proc=4) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14