Working of pandas.apply() with functions

Question

I need to add few calculated columns to a panda dataframe. Some of these columns require the values to be passed to specific functions.

I came across some behavior that I did not understand. With reference to the following code snippet

from numpy.random import randn
from pandas import Dataframe

def just_sum(a,b):
    return a + b

# 1,000,000 columns with random data
df = DataFrame(randn(1000000, 2), columns=list('ab'))

df['reg_sum'] = df.a + df.b
#works almost instantly

df['f_sum'] = df.apply(lambda x: just_sum(x.a, x.b), axis = 1)
# takes little more thatn 30 seconds

Why is the apply method taking so much time ?
Is this the right way to do this ? If not then what is ?

PS : Somebody suggested using Cython. Will that really affect performance ?

score 2 · Answer 1 · edited Jan 26 '18 at 14:43

2

The apply function doesn't take advantage of the vectorization... Every time the function is called it's creating a brand new series so for say millions of rows that's a lot of IO overhead.

Check out a Github issue and see the discussion Pandas Issue 11615

This accepted answer in this other StackOverflow post makes mention of it as well.

Pandas - Explanation on apply function being slow

edited Jan 26 '18 at 14:43

Dennis Soemers

8,090
2
32
55

answered Jan 26 '18 at 13:12

Orenshi

1,773
11
12

score 0 · Accepted Answer · answered Mar 14 '18 at 08:27

Answering the question as there were 2 parts to it.

As @Orenshi said, the apply function doesn't take advantage of the vectorization. The right way to do this is to vectorize the function. The spippet in the question can thus be written as :

from numpy.random import randn
from numpy import vectorize
from pandas import Dataframe

def just_sum(a,b):
    return a + b

# 1,000,000 columns with random data
df = DataFrame(randn(1000000, 2), columns=list('ab'))

vector_sum = vectorize(just_sum)

df['f_sum'] = vector_sum(df.a, df.b)
#works almost instantly

Working of pandas.apply() with functions

2 Answers2