3

I have a python dataframe with 1.5 million rows and 8 columns. I want combine few columns and create a new column. I know how to do this but wanted to know which one is faster and efficient. I am reproducing my code here

import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])

Now here is what I want to achieve

df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']

The other alternative is to use the apply functionality of pandas

df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])

I wanted to know which method takes less time when we have 1.5 millon rows and have to combine 8 columns

piRSquared
  • 285,575
  • 57
  • 475
  • 624
NG_21
  • 685
  • 2
  • 13
  • 22
  • So you are basically asking for someone to benchmark that for you? – David Arenburg Sep 12 '16 at 07:48
  • Yeah because Ii didnt know how to do it until now. Now I know there is something like timeit exists. – NG_21 Sep 12 '16 at 07:52
  • 1
    Well, [you could just Google it](http://stackoverflow.com/questions/1593019/is-there-any-simple-way-to-benchmark-python-script). SO is neither a benchmarking or a Googling service (although some will answer quite about any question). – David Arenburg Sep 12 '16 at 07:56
  • @DavidArenburg- It was not just about benchmarking, if you look at the answer. Experienced people have provided alternative methods to calculate , which are efficient than the ones I was benchmarking – NG_21 Sep 12 '16 at 09:11

2 Answers2

3

First method is faster, because is vectorized:

df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)

#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)

df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)

df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)

print (df)

In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop

In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop

In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop

Another testing in 1.5M size DataFrame, apply method is very slow:

#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)

In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop

In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop

In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

Using @jezrael's setup

df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
df = pd.concat([df]*30000).reset_index(drop=True)

Far more efficient to use a dot product.

np.array([[.5, .3, .2]]).dot(df.values.T).T

Timing

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624