I am using pandarallel to apply a function to my pandas dataframe. Everything works as expected, but the first (out of eight) worker is extremely slow:
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
17.14% :::::: | 170 / 992 |
100.00% :::::::::::::::::::::::::::::::::::::::: | 992 / 992 |
100.00% :::::::::::::::::::::::::::::::::::::::: | 992 / 992 |
100.00% :::::::::::::::::::::::::::::::::::::::: | 992 / 992 |
100.00% :::::::::::::::::::::::::::::::::::::::: | 992 / 992 |
100.00% :::::::::::::::::::::::::::::::::::::::: | 992 / 992 |
100.00% :::::::::::::::::::::::::::::::::::::::: | 991 / 991 |
100.00% :::::::::::::::::::::::::::::::::::::::: | 991 / 991 |
7/8 workers finish within a minute, the first one takes hours to finish. The code is not stuck and does not produce any errors. This seems stupid as 7/8 workers are doing nothing while the first one is almost having a seizure. Is there a way to even out the progress?
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8, progress_bar=True)
def test_function:
[function here]
df_temp = (pd.DataFrame({'page': range(1, 5)}))
df = pd.DataFrame(list(df_temp.page.parallel_apply(test_function)))
PS: The code in [squared brackets] is left out to increase readability, it shouldn't be relevant to my issue.