I need to add few calculated columns to a panda dataframe. Some of these columns require the values to be passed to specific functions.
I came across some behavior that I did not understand. With reference to the following code snippet
from numpy.random import randn
from pandas import Dataframe
def just_sum(a,b):
return a + b
# 1,000,000 columns with random data
df = DataFrame(randn(1000000, 2), columns=list('ab'))
df['reg_sum'] = df.a + df.b
#works almost instantly
df['f_sum'] = df.apply(lambda x: just_sum(x.a, x.b), axis = 1)
# takes little more thatn 30 seconds
- Why is the apply method taking so much time ?
- Is this the right way to do this ? If not then what is ?
PS : Somebody suggested using Cython. Will that really affect performance ?