A nested where should be the fastest
np.where(df.A == 'a', 'a',
np.where((df.A == 'b') & (df.B.isin(['B','C'])), 'A',
np.where(df.C == 'c', 'c', np.nan)))
Speed Test
# create 100,000 rows of random data
df = pd.DataFrame({'A':np.random.choice(['a','b','c','A','B','C'], 100000, True),
'B':np.random.choice(['a','b','c','A','B','C'], 100000, True),
'C':np.random.choice(['a','b','c','A','B','C'], 100000, True)})
%%timeit
np.where(df.A == 'a', 'a',
np.where((df.A == 'b') & (df.B.isin(['B','C'])), 'A',
np.where(df.C == 'c', 'c', np.nan)))
10 loops, best of 3: 33.4 ms per loop
def my_logic(x):
if x[0] == 'a':
return 'a'
elif x[0] == 'b' and x[1] in ('B', 'C'):
return 'A'
elif x[2] == 'c':
return 'c'
return ''
%%timeit
df[['A', 'B', 'C']].apply(my_logic, axis=1)
1 loops, best of 3: 5.87 s per loop
Nested where is 175 times faster than apply
- the method of last resort.