Jupyter Kernel Dies when I use .apply

Question

I have a very large pandas dataframe (few million rows) that I am manipulating. The last column I calculate uses the following code:

df['diff'] = df.apply(lambda row: row.col_a - row.col_b, axis=1)

It is fifty-fifty if the code runs, and if it does, it takes the better part of an hour. Is there a way in pandas to better run. I've started to do some research, I looked at this stackoverflow page (Why is pandas apply lambda slower than loop here?), but it is for categorical data. I've done some research on Vectorized Operations, but haven't found anything that I think will work. Any help is appreciated.

Are you just computing `col_a - col_b`? That can be done with `df.col_a - df.col_b`. — Michael Szczesny, Oct 21 '20 at 21:27
You should use: `df.col_a - df.col_b`. The way you are doing this calculation is calculating row by row versus vectorized. You should only use `axis=1` lambda operations if you really must, but no reason to here. — David Erickson, Oct 21 '20 at 21:29
I way overthought that, ```df.col_a -df.col_b``` worked, thanks y'all. — Joe Wolf, Oct 21 '20 at 21:37

David Erickson · Accepted Answer · 2020-10-21T21:44:01.930

You're row-by-row way of calculating is more than 5,000 times slower from using a vectorized way on a dataframe of all random integers of shape (10000,4). Avoid the combination of lambda and axis=1 if all possible and vectorize.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
%timeit df['E'] = df['A'] - df['B']
%timeit df['E'] = df.apply(lambda x: x.A - x.B, axis=1)
df

485 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.48 s ± 45.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # > 5000x slower

Jupyter Kernel Dies when I use .apply

1 Answers1