My use case is that I want to apply
a function to each row of pandas
dataframe
. The function to be applied has some additional parameters than the row data itself and it does return more than one value. My current implementation is below
def func_apply(x, p1, p2, p3):
return x.name, p1, p2, p3
In func_apply
parameters are returned back just to manifest the situation. In actual scenario, this function takes networkx
graph as input and reads transformations from the dataframe
row and performs those transformations on the graph. So, the func_apply
is not as simple as it looks in this example. Below is the dataframe
data = pd.DataFrame()
data['col1'] = np.random.normal(size = 15000)
data['col2'] = np.random.normal(size = 15000)
New columns are added to the dataframe
with the following line
data[['col3', 'col4', 'col5', 'col6']] = data.apply(func_apply,axis=1, result_type='expand', p1='a', p2='b', p3 = 'c')
Although it works but with the graph and transformation thing that actual func_apply
performs, it takes a long time to execute. How can I speed it up by doing this in parallel?
I have looked at some options like dask
and swifter
but this answer suggests that these tools will not work effectively on string columns. I also tried using multiprocessing
and Pool
but I have no idea how to pass the extra arguments to the func_apply
and how to insert the returned data into new columns of the dataframe
.