0

Below is the function used in my program. I have a huge data frame and I need to run a function such that I split the data frame in a number of parts as provided by parameter n_cores and this should make it run parallel. But I see that it runs sequentially and takes almost the same time (sometimes more than sequential execution because am using multiprocessing). Based on logs I see that it's running sequentially.

Can someone please suggest?

parallelize_dataframe takes input dataframe, function to run and number of cores as input args.

def parallelize_dataframe(df, func, n_cores=4):
    start_time = timeit.default_timer()
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

Using above function from link.

Below are messages I see in logs, which shows that each worker ran in sequence

2020-09-15 01:46:46 INFO TEST: ForkPoolWorker-1 - Started to parse output of sql to create xml files.
2020-09-15 01:47:08 INFO TEST: ForkPoolWorker-1 - Finished parsing output of sql to create xml files.
2020-09-15 01:48:17 INFO TEST: ForkPoolWorker-2 - Started to parse output of sql to create xml files.
2020-09-15 01:48:48 INFO TEST: ForkPoolWorker-2 - Finished parsing output of sql to create xml files.
2020-09-15 01:50:00 INFO TEST: ForkPoolWorker-3 - Started to parse output of sql to create xml files.
2020-09-15 01:50:36 INFO TEST: ForkPoolWorker-3 - Finished parsing output of sql to create xml files.
2020-09-15 01:52:42 INFO TEST: ForkPoolWorker-4 - Started to parse output of sql to create xml files.
2020-09-15 01:53:08 INFO TEST: ForkPoolWorker-4 - Finished parsing output of sql to create xml files.
Shankar Guru
  • 1,071
  • 2
  • 26
  • 46
  • 1
    According to [this answer](https://stackoverflow.com/questions/13264435/do-multiprocessing-pools-give-every-process-the-same-number-of-tasks-or-are-the), the `chunksize` used for the `pool.map` looks like it will be computed to be 1 if `df_split` generates only 4 items. You could try explicitly setting `chunksize`. You didn't show `df_split`. There is quite a bit of overhead in creating processes and these are not long running jobs. This may be something better suited for threads. – Booboo Sep 15 '20 at 10:23
  • I tried setting `chunksize ` to 1,4 and 8. Still am having same issue. – Shankar Guru Sep 15 '20 at 12:12
  • Later I tried `Pandarallel` which reduces the time taken by around 40%. But, would have preferred multi-processing as was asked in question than using external python packages. – Shankar Guru Sep 15 '20 at 12:20

0 Answers0