Parallelize pandas dataframe.apply function

Question

My use case is that I want to apply a function to each row of pandas dataframe. The function to be applied has some additional parameters than the row data itself and it does return more than one value. My current implementation is below

def func_apply(x, p1, p2, p3):
    return x.name, p1, p2, p3

In func_apply parameters are returned back just to manifest the situation. In actual scenario, this function takes networkx graph as input and reads transformations from the dataframe row and performs those transformations on the graph. So, the func_apply is not as simple as it looks in this example. Below is the dataframe

data = pd.DataFrame()
data['col1'] = np.random.normal(size = 15000)
data['col2'] = np.random.normal(size = 15000)

New columns are added to the dataframe with the following line

data[['col3', 'col4', 'col5', 'col6']] = data.apply(func_apply,axis=1, result_type='expand', p1='a', p2='b', p3 = 'c')

Although it works but with the graph and transformation thing that actual func_apply performs, it takes a long time to execute. How can I speed it up by doing this in parallel?
I have looked at some options like dask and swifter but this answer suggests that these tools will not work effectively on string columns. I also tried using multiprocessing and Pool but I have no idea how to pass the extra arguments to the func_apply and how to insert the returned data into new columns of the dataframe.

salam, is not possible to apply vectorised solutions on your code, is it only possible via a custom function? — Umar.H, Dec 29 '19 at 00:30
wa-alikum-salam. yep, it's only possible via a custom function — Muhammad Adeel Zahid, Dec 29 '19 at 00:34
then `Dask` seems like the winner here [this](https://stackoverflow.com/a/45545111/9375102) answer doesn't specifically say anything about using strings vs integers? — Umar.H, Dec 29 '19 at 00:35

Parallelize pandas dataframe.apply function

0 Answers0