1

For a pandas DataFrame with groups I want to keep all rows until the first occurence of a specific value (and discard all other rows).

MWE:

import pandas as pd
df = pd.DataFrame({'A' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'tmp'],
                   'B' : [0, 1, 0, 0, 0, 1, 0],
                   'C' : [2.0, 5., 8., 1., 2., 9., 7.]})

gives

    A    B  C
0   foo  0  2.0
1   foo  1  5.0
2   foo  0  8.0
3   bar  0  1.0
4   bar  0  2.0
5   bar  1  9.0
6   tmp  0  7.0

and I want to keep all rows for each group (A is the grouping variable) until B == 1 (including this row). So, my desired output is

    A    B  C
0   foo  0  2.0
1   foo  1  5.0
3   bar  0  1.0
4   bar  0  2.0
5   bar  1  9.0
6   tmp  0  7.0

How can I keep all rows of a grouped DataFrage meeting a certain criteria?

I found how to drop specific groups not meeting a certain criteria (and keeping all other rows of all other groups), but not how to drop specific rows for all groups. The farest I got was to get the indices of the rows in each group, I want to keep:

df.groupby('A').apply(lambda x: x['B'].cumsum().searchsorted(1))

resulting in

A
bar    2
foo    1
tmp    1

Which isn't sufficient, as it does not return the actual data (and it might be better, if for tmp the result was 0)

Qaswed
  • 3,649
  • 7
  • 27
  • 47

1 Answers1

1

After reading this question about the difference between groupby.apply and groupby.aggregate, I realized that apply works on all columns and rows (thus a DataFrame?) of the group. So this is my function that should be applied on every group:

def f(group):
    index = min(group['B'].cumsum().searchsorted(1), len(group))
    return group.iloc[0:index+1]

By running df.groupby('A').apply(f) I get the desired result:

            A       B   C
A               
bar     3   bar     0   1.0
        4   bar     0   2.0
        5   bar     1   9.0
foo     0   foo     0   2.0
        1   foo     1   5.0
tmp     6   tmp     0   7.0
Qaswed
  • 3,649
  • 7
  • 27
  • 47