Using Multiprocessing with Dataframes

Question

I have a function that has 4 nestled for loops in it. The function takes in a dataframe and returns a new dataframe. Currently the function takes about 2 hours to run, I need it to run in around 30 mins...

I've tried multiprocessing using 4 cores but I cant seem to get it to work. I start by creating a list of my input dataframe split into smaller chunks (list_of_df)

all_trips = uncov_df.TRIP_NO.unique()

list_of_df = []
for trip in all_trips:
    list_of_df.append(uncov_df[uncov_df.TRIP_NO==trip])

I then tried mapping this list of chunks into my function (transform_df) using 4 pools.

from multiprocessing import Pool

if __name__ == "__main__":
    with Pool(4) as p:
        df_uncov = list(p.map(transform_df, list_of_df))
        
df = pd.concat(df_uncov)

When I run the above my code cell freezes and nothing happens. Does anyone know what's going on?

Looks about right, are you running out of memory? With four nested for-loops I'd rather look into numba or cython though... — mcsoini, Nov 15 '21 at 14:48
@mcsoini No problems with memory. Thanks, I'll have a read over numba and cython documentation. — Ramon Santiago, Nov 15 '21 at 14:55
can you post some of your dataframe that covers a few different trip numbers? just to see if the basic multiprocessing works without going through the full nested function? — Jonathan Leon, Nov 15 '21 at 16:45
Are you using a Jupyther notebook (_"code cell"_ sounds like it)? If so, look [here](https://stackoverflow.com/q/47313732/14311263). — Timus, Nov 15 '21 at 17:53

score 0 · Answer 1 · answered Nov 15 '21 at 16:53

This is how I set mine up using starmap. This returns a list of dfs to be concatenated later.

#put this above if __name__ == "__main__":
def get_dflist_multiprocess(keys_list, num_proc=4):
    with Pool(num_proc) as p:
        df_list = p.starmap(transform_df, list_of_df)
        p.close()
        p.join()
    return df_list

#then below if __name__ == "__main__":
df_list = get_dflist_multiprocess(list_of_df, num_proc=4) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)

Using Multiprocessing with Dataframes

1 Answers1