Below is the function used in my program. I have a huge data frame and I need to run a function such that I split the data frame in a number of parts as provided by parameter n_cores
and this should make it run parallel. But I see that it runs sequentially and takes almost the same time (sometimes more than sequential execution because am using multiprocessing). Based on logs I see that it's running sequentially.
Can someone please suggest?
parallelize_dataframe
takes input dataframe
, function to run
and number of cores
as input args.
def parallelize_dataframe(df, func, n_cores=4):
start_time = timeit.default_timer()
df_split = np.array_split(df, n_cores)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
Using above function from link.
Below are messages I see in logs, which shows that each worker ran in sequence
2020-09-15 01:46:46 INFO TEST: ForkPoolWorker-1 - Started to parse output of sql to create xml files.
2020-09-15 01:47:08 INFO TEST: ForkPoolWorker-1 - Finished parsing output of sql to create xml files.
2020-09-15 01:48:17 INFO TEST: ForkPoolWorker-2 - Started to parse output of sql to create xml files.
2020-09-15 01:48:48 INFO TEST: ForkPoolWorker-2 - Finished parsing output of sql to create xml files.
2020-09-15 01:50:00 INFO TEST: ForkPoolWorker-3 - Started to parse output of sql to create xml files.
2020-09-15 01:50:36 INFO TEST: ForkPoolWorker-3 - Finished parsing output of sql to create xml files.
2020-09-15 01:52:42 INFO TEST: ForkPoolWorker-4 - Started to parse output of sql to create xml files.
2020-09-15 01:53:08 INFO TEST: ForkPoolWorker-4 - Finished parsing output of sql to create xml files.