1

I'm trying to find a very good, comprehensive answer to the above question.

Let's take the following dataframe:

df = pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D'])
df['Category'] = pd.Series(['Cat1', 'Cat2', 'Cat1', 'Cat2'])
df:
            A         B         C         D Category
  0  0.057456  0.462712 -1.245539  0.095947     Cat1
  1  0.741688  0.353600  1.130754 -1.692440     Cat2
  2  0.638623 -0.109566  0.118509 -0.692661     Cat1
  3 -0.905806  0.284818 -0.094696 -1.325617     Cat2

I understand the basic nomenclature of using df.A or df['A'] to select column A, and df[['A', 'B']] to select both columns A and B.

And if I want to pass in boolean conditions I can do things like df[df.A > 0] to return rows where df.A is greater than 0. And I can include multiple conditions like df[(df.A > 0) & (df.Category == 'Cat1')] to include multiple conditions.

One thing I am not seeing though, is how I could select a subset of the dataframe columns with the same two boolean conditions used above.

Two approaches I have tried without success:

df['B'].where(df.A > 0)

This returns all of the rows in the index, whereas I'd only like the ones that return True for said condition.

I've also tried using query without success because it can't be used on a Series.

So df['A'].query(some condition) returns an error message.

Ideally the solution I'm looking for has the following conditions:

  1. Can be used both on single columns or multiple columns. Ie, you can choose both df['A'] and df['B'] with it.
  2. Can use multiple boolean conditions if you want to, not just one.
  3. Only returns the rows that evaluate to True for the condition, not all the rows.
  4. Ideally is syntactically simple, but if that's not possible then so be it.

Thank you.

Jonathan Bechtel
  • 3,497
  • 4
  • 43
  • 73

3 Answers3

5

I think you are looking for .loc

df.loc[df.A>0,'B']
0    0.462712
1    0.353600
2   -0.109566
Name: B, dtype: float64
df.loc[(df.A>0)&(df.B<0),['B','C']]
          B         C
2 -0.109566  0.118509
BENY
  • 317,841
  • 20
  • 164
  • 234
2

You are proceeding correctly with your df['B'].where(df.A > 0) condition. However, just append few more function after it to achieve only True results.

An example with multiple (can pass single too) condition:

df[['B','C']].where((df.A > 1) & (df.D>0)).dropna().reset_index(drop=True)
meW
  • 3,832
  • 7
  • 27
2

Just a variation of @W-B's answer using query:

df.query('A > 0 and B < 0').loc[:, ['B', 'C']]

or

df.query('A > 0 and B < 0')[['B', 'C']]

will also print

          B         C
2 -0.109566  0.118509
Cleb
  • 25,102
  • 20
  • 116
  • 151
  • 1
    @Datanovice: What do you mean by "only work on the index"? I looked at the four requirements listed in the question, and think that `query` addresses all of them. It also [works fine on different index levels](https://stackoverflow.com/a/49045951/1534017). – Cleb Jan 01 '19 at 22:55
  • Apologies, I should have consulted the docs first. I was thinking of the. .filter function. Thanks for teaching me something new. – Umar.H Jan 01 '19 at 22:58
  • @Datanovice: No worries :) – Cleb Jan 01 '19 at 22:59