8

This question is very related to these two questions another and thisone, and I'll even use the example from the very helpful accepted solution on that question. Here's the example from the accepted solution (credit to unutbu):

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

But I want to have all rows of A and only the arrows in B that have 'two' in them. My attempt at it is to try

print(df.loc[df['A']) & df['B'] == 'two'])

This does not work, unfortunately. Can anybody suggest a way to implement something like this? it would be of a great help if the solution is somewhat general where for example column A doesn't have the same value which is 'foo' but has different values and you still want the whole column.

  • If you dont want to filter on A, then simply dont include any condition on this column. – jpp Feb 25 '18 at 23:38

3 Answers3

7

Easy , if you do

     df[['A','B']][df['B']=='two']

you will get:

    A    B

2  foo  two
4  foo  two
5  bar  two

To filter on both A and B:

    df[['A','B']][(df['B']=='two') & (df['A']=='foo')]

You get:

        A    B
    2  foo  two
    4  foo  two

and if you want all the columns :

        df[df['B']=='two']

you will get:

            A    B  C   D
        2  foo  two  2   4
        4  foo  two  4   8
        5  bar  two  5  10    
Humi
  • 609
  • 5
  • 15
4

I think I understand your modified question. After sub-selecting on a condition of B, then you can select the columns you want, such as:

In [1]: df.loc[df.B =='two'][['A', 'B']]
Out[1]: 
     A    B
2  foo  two
4  foo  two
5  bar  two

For example, if I wanted to concatenate all the string of column A, for which column B had value 'two', then I could do:

In [2]: df.loc[df.B =='two'].A.sum()  # <-- use .mean() for your quarterly data
Out[2]: 'foofoobar'

You could also groupby the values of column B and get such a concatenation result for every different B-group from one expression:

In [3]: df.groupby('B').apply(lambda x: x.A.sum())
Out[3]: 
B
one      foobarfoo
three       barfoo
two      foofoobar
dtype: object

To filter on A and B use numpy.logical_and:

In [1]: df.loc[np.logical_and(df.A == 'foo', df.B == 'two')]
Out[1]: 
     A    B  C  D
2  foo  two  2  4
4  foo  two  4  8
ely
  • 74,674
  • 34
  • 147
  • 228
  • sorry but i edited what i asked, i want to have the whole column A and not only foo so in the case the one with index 5 should also be included but it should assume that i don't know what the column has i just want it all. –  Feb 25 '18 at 23:32
  • @AbdulMalekAltawekji Your question is not clear. What do you mean that you want "the whole column of A", but "only rows of B that have 'two' in them"? Those two conditions don't seem compatible. – ely Feb 25 '18 at 23:33
  • sorry for the inclarity. i meant i want to not only foo in column A but if there was in column A a boo and a gee as well which also has the value 'two' in the column B, i would also like to have them displayed but without assuming that i know that the column A has a boo and a gee as well. So as long as it's in A and in B it has a 'two', i want it. –  Feb 25 '18 at 23:36
  • So do you mean you just want to filter on `B`, but then choose columns A and B after filtering? I'll add an update. – ely Feb 25 '18 at 23:38
  • Maybe it's easier to explain what i'm personally trying to achieve, i have a huge dataset with information about movies. I have a column that is called midweekGains which has numerical values about gains during of the midweek of the movies. I also have a column that is called 'quarter' which is about which quarter of the year the film was published which has 4 values, 1,2,3 and 4 so i'm trying to calculate the mean of the midweekGains of the films in quarter 1 for instance. –  Feb 25 '18 at 23:39
  • Yes, then you don't care about filtering on `midweekGain` at all. You *only* care about filtering on `quarter` and then doing regular averaging afterwards. In other words, you don't need anything like `df['A']` in your question. You *only* need `df['quarter'] == 1`. – ely Feb 25 '18 at 23:41
  • oh sorry for not noticing that. As someone who just started with this, i figured out that i didn't understand it correctly. Thanks for pointing that out! –  Feb 25 '18 at 23:42
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/165796/discussion-between-abdul-malek-altawekji-and-ely). –  Feb 25 '18 at 23:45
3

Row subsetting: Isn't this you are looking for ?

df.loc[(df['A'] == 'foo') & (df['B'] == 'two')]

   A   B  C D
2 foo two 2 4
4 foo two 4 8

You can also add .reset_index() at the end to initialize indexes from zero.

YOLO
  • 20,181
  • 5
  • 20
  • 40