I am trying to parallelize a function on a pandas DataFrame, and I am wondering why the parallelization is that much slower than single-core solution. I am aware that parallelization has its costs... but I am curious if there is a way to improve the code so that the parallelization would be faster.
In my case I am having a list of User-Ids (300 000 (all strings)) and need to check if the User-Id is also present in another list containing only 10 000 entries.
As I cannot reproduce the original code, so I am giving an example with integers that results in the same performance problem:
import pandas as pd
import numpy as np
from joblib import Parallel, delayed
import time
df = pd.DataFrame({'All': np.random.randint(50000, size=300000)})
selection = pd.Series({'selection': np.random.randint(10000, size=10000)}).to_list()
t1=time.perf_counter()
df['Is_in_selection_single']=np.where(np.isin(df['All'], selection),1,0).astype('int8')
t2=time.perf_counter()
print(t2-t1)
def add_column(x):
return(np.where(np.isin(x, selection),1,0))
df['Is_in_selection_parallel'] = Parallel(n_jobs=4)(delayed(add_column)(x) for x in df['All'].to_list())
t3=time.perf_counter()
print(t3-t2)
The time-print results in the following:
0.0307
53.07
which means the parallelization is 1766 times slower than the single core.
In my real word example, with the User-Id, the single core takes 1 minute, but the parallelization has not finished after 15min...
I would need the parallization because I need to make this operation a couple of times, so the final script takes several minutes to run. Thank you for any suggestions!