0

As far as I know using .apply() in pandas is rather inefficient because it isn't vectorized. I have a bunch of relatively normal operations like addition or multiplication which I want to do differently depending on the content of certain columns.

The central question is what are the advantages and disadvantages of the two below code snippets:

df['col'] = df['col'].apply(lambda x: x/df['col'].max() if x < 1000 else x)

# or 

df.loc[df['col']<1000,'col'] = df["col"]/df['col'].max()

I've noticed that the first is slower but I've seen it recommended a lot and I sometimes get slice errors for the second version so was hesitant to use it.

1 Answers1

0

When you use loc to set a subset on the LHS, you should also subset on the RHS so it's explicit. This will avoid errors in cases where the index might be duplicated.

import pandas as pd
df = pd.DataFrame({'col': range(997,1003)})

m = df['col'].lt(1000)
df.loc[m, 'col'] = df.loc[m, 'col']/df['col'].max()
#           col
#0     0.995010
#1     0.996008
#2     0.997006
#3  1000.000000
#4  1001.000000
#5  1002.000000

Alternatively, use np.where for an if-else clause:

import numpy as np

df = pd.DataFrame({'col': range(997,1003)})
df['col'] = np.where(df['col'].lt(1000), df['col']/df['col'].max(), df['col'])

In terms of using apply this question has much more thorough answers. Particularly, see @jpp's answer. You may have have seen .apply suggested for a groupby object, or to perform column-wise calculations for a narrow DataFrame, which are typically fine.

ALollz
  • 57,915
  • 7
  • 66
  • 89