0

I am using pandarallel to apply a function to my pandas dataframe. Everything works as expected, but the first (out of eight) worker is extremely slow:

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
  17.14% ::::::                                   |      170 /      992 |                                                     
 100.00% :::::::::::::::::::::::::::::::::::::::: |      992 /      992 |                                                     
 100.00% :::::::::::::::::::::::::::::::::::::::: |      992 /      992 |                                                     
 100.00% :::::::::::::::::::::::::::::::::::::::: |      992 /      992 |                                                     
 100.00% :::::::::::::::::::::::::::::::::::::::: |      992 /      992 |                                                     
 100.00% :::::::::::::::::::::::::::::::::::::::: |      992 /      992 |                                                     
 100.00% :::::::::::::::::::::::::::::::::::::::: |      991 /      991 |                                                     
 100.00% :::::::::::::::::::::::::::::::::::::::: |      991 /      991 |       

7/8 workers finish within a minute, the first one takes hours to finish. The code is not stuck and does not produce any errors. This seems stupid as 7/8 workers are doing nothing while the first one is almost having a seizure. Is there a way to even out the progress?

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8, progress_bar=True)
def test_function:
 [function here]
df_temp = (pd.DataFrame({'page': range(1, 5)}))
df = pd.DataFrame(list(df_temp.page.parallel_apply(test_function)))

PS: The code in [squared brackets] is left out to increase readability, it shouldn't be relevant to my issue.

diggi2395
  • 185
  • 8

0 Answers0