2

I have more experience with SQL then with Python and now start to use Python more. I've read comparison with sql for pandas.

Groupby is clear to understand for me groupby('colname').

However why for select we need to write name of frame twice like in example frame[frame['col1'].notna()] I could not find a reason via web search.

Alex Martian
  • 3,423
  • 7
  • 36
  • 71
  • This is called boolean masking, and is a way to select subsets of your data. See [the indexing docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) – G. Anderson Sep 05 '19 at 15:56
  • I suggest you run it step by step. `DataFrame.notna()` creates logical arrays(df) that you then use for subsetting. – NelsonGon Sep 05 '19 at 15:59
  • 1
    It is a python convention for numpy and pandas (which is built on numpy), as in the link I posted it's also called boolean indexing, and you can also use the pandas `mask()` function to achieve the same result – G. Anderson Sep 05 '19 at 16:01
  • 1
    @NelsonGon, I did that, I just could not get how array of `True` could be useful for _getitem_ as there were no such (`True`) items in my frame. With G. Anderson remarks I see now. – Alex Martian Sep 05 '19 at 16:08

2 Answers2

1

Just summarizing helpful comments:

This is called boolean masking/indexing, and is a way to select subsets of your data. It is a Python convention for numpy and pandas (which is built on numpy), pandas mask() function can be used to achieve the same result.

Alex Martian
  • 3,423
  • 7
  • 36
  • 71
  • also related use of bitwise `&` `|` not Pythons's `and` `or` in pandas https://stackoverflow.com/questions/21415661/logical-operators-for-boolean-indexing-in-pandas – Alex Martian Sep 06 '19 at 08:06
0

Just to add, nowadays you can use the query method to achieve a somewhat more natural SQL-like syntax, see e.g. Querying for NaN and other names in Pandas

Ben Farmer
  • 2,387
  • 1
  • 25
  • 44