I’ve been working with SAS and SQL for about two decades, and I’m relatively new to python and pandas. There are certain things in python that keep surprising me, and the performance issue with result_type is one of those things.
import pandas as pd
from datetime import datetime as dt
# Dataframe
df = pd.DataFrame()
df["A"] = range( 0 , 200000 )
df["B"] = df.A * 3
# Timer
def TimeStamp( X ):
print( "*** " + X + " - " + dt.strftime( dt.now() , "%Y.%m.%d %H:%M:%S" ))
# Method A -------------------------------------------------------------------
TimeStamp( "Method A - start" )
def Calc1( X ):
return X.A + X.B
def Calc2( X ):
return X.B - X.A
df["V"] = df.apply( Calc1 , axis = 1 )
df["W"] = df.apply( Calc2 , axis = 1 )
TimeStamp( "Method A - end" )
# Method B -------------------------------------------------------------------
TimeStamp( "Method B - start" )
def Calc3( X ):
return X.A + X.B , X.B - X.A
df[["X" , "Y"]] = df.apply( Calc3 , axis = 1 , result_type = "expand" )
TimeStamp( "Method B - end" )
I would have expected method B to be faster than method A, because method B passes the data only once, and method A passes the data twice. However, method B takes more than twice as long as method A.
*** Method A - start - 2020.04.20 16:46:33
*** Method A - end - 2020.04.20 16:46:42
*** Method B - start - 2020.04.20 16:46:42
*** Method B - end - 2020.04.20 16:47:03
Can anyone explain to me why that is the case? I assume the problem lies in the “expanding” bit.