21

All the research I do point to using loc as the way to filter a dataframe by a col(s) value(s), today I was reading this and I discovered by the examples I tested, that loc isn't really needed when filtering cols by it's values:

EX:

df = pd.DataFrame(np.arange(0, 20, 0.5).reshape(8, 5), columns=['a', 'b', 'c', 'd', 'e'])    

df.loc[df['a'] >= 15]

      a     b     c     d     e
6  15.0  15.5  16.0  16.5  17.0
7  17.5  18.0  18.5  19.0  19.5

df[df['a'] >= 15]

      a     b     c     d     e
6  15.0  15.5  16.0  16.5  17.0
7  17.5  18.0  18.5  19.0  19.5

Note: I do know that doing loc or iloc return the rows by it's index and the position. I'm not comparing based on this functionality.

But when filtering, doing "where" clauses what's the difference between using or not using loc? If any. And why do all the examples I come across regarding this subject use loc?

Miguel
  • 1,579
  • 5
  • 18
  • 31
  • 3
    In this case, you're right. For simple filtering there is no difference between passing your boolean array as `df.loc[]` or directly to `df[]`. The power or `.loc[]` comes from more complex look-ups, when you want specific rows and columns. It's syntax is also more flexible, generalized, and less error-prone than chaining together multiple boolean conditions. Overall it makes for more robust accessing/filtering of data in your df. – cvonsteg Nov 14 '18 at 10:10
  • 2
    @cvonsteg, As an extension, would you say `df[df.columns[::-1]]` would be the same (i.e. just syntactic sugar or O(1) performance differential) as `df.iloc[:, ::-1]`? Because that [doesn't seem to be the case](https://stackoverflow.com/questions/51486063/what-is-the-big-o-complexity-of-reversing-the-order-of-columns-in-pandas-datafra). Personally, I find it confusing there doesn't seem to be any official docs on what `__getitem__` does and when/how. – jpp Nov 14 '18 at 10:53
  • 1
    @jpp full disclaimer - I have not looked into this, so the following is just speculation. For large data sets, I imagine you may see divergence between the two approaches, favoring .iloc[]. This is solely based on the notion that df[df.columns[]] performs multiple operations on the df (creating Index object, then `__getitem__`) whilst .iloc[] is probably an optimized approach to this (maybe invoking generators, something along those lines?). But you're right the [Data Model](http://farmdev.com/src/secrets/magicmethod/index.html#introducing-getitem) docs are pretty sparse. – cvonsteg Nov 14 '18 at 15:59

1 Answers1

17

As per the docs, loc accepts a boolean array for selecting rows, and in your case

>>> df['a'] >= 15
>>> 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
Name: a, dtype: bool

is treated as a boolean array.

The fact that you can omit loc here and issue df[df['a'] >= 15] is a special case convenience according to Wes McKinney, the author of pandas.

Quoting directly from his book, Python for Data Analysis, p. 144, df[val] is used to...

Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)

timgeb
  • 76,762
  • 20
  • 123
  • 145
  • 2
    `pd.DataFrame.__getitem__` and `pd.Series.__getitem__` don't seem to be documented for Boolean indexing. Of course, many cases of `df[col_name]` exist in the docs. So possibly we shouldn't even rely on row-wise indexing and always use `loc`? – jpp Nov 14 '18 at 10:23
  • 2
    @jpp It would probably be a bit more explicit to use `loc`. Personally, I would like to stick to the convenience feature. I doubt the devs will take it away at this point. – timgeb Nov 14 '18 at 10:42
  • I take your point. But I would also say performance is *not* always identical in the general case. As an example `df[df.columns[::-1]]` can perform much worse than `df.iloc[:, ::-1]`. I wish there was more advice on `__getitem__` as it is (as you indicate) pretty fundamental to how people use Pandas. – jpp Nov 14 '18 at 10:43