0

I have a very large pandas dataframe (few million rows) that I am manipulating. The last column I calculate uses the following code:

df['diff'] = df.apply(lambda row: row.col_a - row.col_b, axis=1)

It is fifty-fifty if the code runs, and if it does, it takes the better part of an hour. Is there a way in pandas to better run. I've started to do some research, I looked at this stackoverflow page (Why is pandas apply lambda slower than loop here?), but it is for categorical data. I've done some research on Vectorized Operations, but haven't found anything that I think will work. Any help is appreciated.

Joe Wolf
  • 61
  • 12

1 Answers1

3

You're row-by-row way of calculating is more than 5,000 times slower from using a vectorized way on a dataframe of all random integers of shape (10000,4). Avoid the combination of lambda and axis=1 if all possible and vectorize.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
%timeit df['E'] = df['A'] - df['B']
%timeit df['E'] = df.apply(lambda x: x.A - x.B, axis=1)
df

485 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.48 s ± 45.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # > 5000x slower
David Erickson
  • 16,433
  • 2
  • 19
  • 35