1

I have two ways in my script how i select specific rows from dataframe:

1.

df2 = df1[(df1['column_x']=='some_value')]

2.

df2 = df1.loc[df1['column_x'].isin(['some_value'])]

From a efficiency perspective, and from a pythonic perspective (as in, what is most python way of coding) which method of selecting specific rows is preferred?

P.S. Also, I feel there are probably even more ways to achieve the same. P.S.S. I feel that this question is already been asked, but i couldnt find it. Please reference if duplicate

J.A.Cado
  • 715
  • 5
  • 13
  • 24
  • 1
    Well you are technically selecting `rows` with this method, since you are masking with a Boolean series. – ALollz Nov 29 '18 at 15:32
  • 1
    I'm not sure if this helps, but there is a fairly good explanation about selecting methods here --> [https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas] – Pedro Martins de Souza Nov 29 '18 at 15:33
  • The marked duplicate has answers which focus on performance. – jpp Nov 29 '18 at 15:45

1 Answers1

4

They are different. df1[(df1['column_x']=='some_value')] is fine if you're just looking for a single value. The advantage of isin is that you can pass it multiple values. For example: df1.loc[df1['column_x'].isin(['some_value', 'another_value'])]

It's interesting to see that from a performance perspective, the first method (using ==) actually seems to be significantly slower than the second (using isin):

import timeit

df = pd.DataFrame({'x':np.random.choice(['a','b','c'],10000)})

def method1(df = df):
    return df[df['x'] == 'b']

def method2(df=df):
    return df[df['x'].isin(['b'])]

>>> timeit.timeit(method1,number=1000)/1000
0.001710233046906069
>>> timeit.timeit(method2,number=1000)/1000
0.0008507879299577325
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • 2
    This must be an issue with the `Series` versus `ndarray`. I think `df[df['x'].values == 'b']` should outperform both – ALollz Nov 29 '18 at 15:40
  • 2
    That's true, it's often faster to work with the underlying arrays. In this case, the timing goes down to about 0.0006 seconds, so a slight improvement (less than I thought it would, TBH) – sacuL Nov 29 '18 at 15:43