Why is result_type = “expand” on a dataframe so slow?

Question

I’ve been working with SAS and SQL for about two decades, and I’m relatively new to python and pandas. There are certain things in python that keep surprising me, and the performance issue with result_type is one of those things.

import pandas as pd
from datetime import datetime as dt

# Dataframe
df = pd.DataFrame()
df["A"] = range( 0 , 200000 )
df["B"] = df.A * 3


# Timer
def TimeStamp( X ):
    print( "*** " + X + " - " + dt.strftime( dt.now() , "%Y.%m.%d %H:%M:%S" ))


# Method A -------------------------------------------------------------------
TimeStamp( "Method A - start" )    

def Calc1( X ):
    return X.A + X.B

def Calc2( X ):
    return X.B - X.A

df["V"] = df.apply( Calc1 , axis = 1 )
df["W"] = df.apply( Calc2 , axis = 1 )

TimeStamp( "Method A - end" )    


# Method B -------------------------------------------------------------------
TimeStamp( "Method B - start" )    

def Calc3( X ):
    return X.A + X.B , X.B - X.A

df[["X" , "Y"]] = df.apply( Calc3 , axis = 1 , result_type = "expand" )

TimeStamp( "Method B - end" )

I would have expected method B to be faster than method A, because method B passes the data only once, and method A passes the data twice. However, method B takes more than twice as long as method A.

*** Method A - start - 2020.04.20 16:46:33
*** Method A - end - 2020.04.20 16:46:42
*** Method B - start - 2020.04.20 16:46:42
*** Method B - end - 2020.04.20 16:47:03

Can anyone explain to me why that is the case? I assume the problem lies in the “expanding” bit.

`apply` is just slow in general because of its iterative nature. https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code. You would be much better off calling the function and then assign to the dataframe: `c = Calc3(df); df['xx'] = c[0]; df['yy'] = c[1]` which runs in 3 ms: `3.26 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)` — It_is_Chris, Apr 20 '20 at 15:11
@Yo_Chris this is both funny and frustrating. All tutorials will tell you to use the apply method with axis=1. I’m doing a python for data science course from a well known company, and even they teach you to use the apply method. But sending the whole dataframe as an argument to the function is way faster. I would never have thought about doing it that way. I guess that is the biggest problem with python, the lack of standardization. But many thanks for your answer. — SBurggraaff, Apr 20 '20 at 16:27
apply should be avoided when possible especially with large datasets. There is generally always a vectorized method that can be applied...like for your example an even faster approach would be avoiding the function all together: `df['xx'] = df['A'].add(df['B']); df['yy'] = df['B'].sub(df['A'])` which runs in `1.41 ms ± 91.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)` — It_is_Chris, Apr 20 '20 at 16:32
@Yo_Chris The only problem I have with that, is a lack of readability. `df[‘xx’] = df.A + df.B` is easier to read than `df[‘xx’] = df.A.add( df.B)` So from a maintenance perspective it might be better to keep it simple. But if you want to squeeze that last bit of performance then why not ;-) — SBurggraaff, Apr 20 '20 at 17:01
@Yo_Chris Oh hang on, it’s not as great as I thought. Yes, it works in this specific example but you end up with several series in your function. So for instance, you can’t do `if( X.A > 0 ) :`. That severely limits the usability. That’s a pity :-/ — SBurggraaff, Apr 20 '20 at 18:10
In that situation you would use boolean indexing with iloc or loc or use a numpy solution with `np.were` or `np.select` Say you want to sum A and B if A > 0 else B minus A. then you could do `df['xx'] = np.select(df['A']> 0, df['A'].add(df['B'], df['B'].sub(df['A'])))` — It_is_Chris, Apr 20 '20 at 18:14
@Yo_Chris Ah, np.where is basically a ternary operator and np.select looks like a switch. Nice! — SBurggraaff, Apr 20 '20 at 19:38

Why is result_type = “expand” on a dataframe so slow?

0 Answers0