1

Can someone please help me understand, while subsetting a dataframe, we generally do

df.loc[df['col_1']==0]

But when I have to subset based on two column values or multiple criteria, we do this instead

df[(df['col_1']==0) & (df['col_2']>0)]

Why is .loc not used in 2nd criteria?

Also, why can't we use and in the second code, i.e.

df[(df['col_1']==0) and (df['col_2']>0)] ?
lightyagami96
  • 336
  • 1
  • 4
  • 14

1 Answers1

1

Why is .loc not used in 2nd criteria?

df = pd.DataFrame({

         'col_1':[0,3,0,7,1,0],
         'col_2':[0,3,6,9,2,4],
         'col3':list('aaabbb')
})

No, you are wrong, it working in both.

print (df.loc[df['col_1']==0])
   col_1  col_2 col3
0      0      0    a
2      0      6    a
5      0      4    b

print (df.loc[(df['col_1']==0) & (df['col_2']>0)])
   col_1  col_2 col3
2      0      6    a
5      0      4    b

print (df[df['col_1']==0])
   col_1  col_2 col3
0      0      0    a
2      0      6    a
5      0      4    b

print (df[(df['col_1']==0) & (df['col_2']>0)])
   col_1  col_2 col3
2      0      6    a
5      0      4    b

Reason for using is if need also filter columns names, e.g. col_1:

print (df.loc[df['col_1']==0, 'col_2'])
0    0
2    6
5    4
Name: col_2, dtype: int64


print (df.loc[(df['col_1']==0) & (df['col_2']>0), 'col_2'])
2    6
5    4
Name: col_2, dtype: int64

If need filter 2 or more columns use list, e.g for col_1,col3 use:

print (df.loc[df['col_1']==0, ['col_1','col3']])
   col_1 col3
0      0    a
2      0    a
5      0    b

print (df.loc[(df['col_1']==0) & (df['col_2']>0), ['col_1','col3']])
   col_1 col3
2      0    a
5      0    b

If omit loc it failed:

df[df['col_1']==0, 'col_1']
df[(df['col_1']==0) & (df['col_2']>0), 'col_1']

TypeError


Also, why can't we use and in the second code, i.e.

df[(df['col_1']==0) and (df['col_2']>0)]

becasue and is for processing by scalars, in pandas are used & for bitwise AND - &. More info is here.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252