Remove rows in a group by until the last row meets some condition

Question

I have the following df

We can assume that this data is already sorted. What i need to do is, for every id, I need to remove rows under the following conditions

the first entry for every id is type A
the last entry for every id is type B
the last entry's B is the last one that appears (data is already sorted)

I've accomplished 1. with the following:

df = df.groupby('id').filter(lambda x: x['Type'].iloc[0] != 'A')

Which removes ids entirely if their first type isn't A

However, for 2. and 3., I don't want to remove the id if the last type isn't B, instead I just want to remove everything in the middle

Resulting df:

example code:

d = {'id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3},
 'type': {0: 'A',
  1: 'B',
  2: 'A',
  3: 'A',
  4: 'B',
  5: 'A',
  6: 'B',
  7: 'A',
  8: 'B',
  9: 'A',
  10: 'A'}}

df = pd.DataFrame.from_dict(d)

what if the first `A` appears before the last `B` for some id? — , Apr 21 '22 at 17:08
every id should start with an `A`, the rest don't matter whether they're before or after the next `B` within an id essentially the first `A` and the last `B` regardless of what is inbetween for each each id — S44, Apr 21 '22 at 17:13

score 1 · Accepted Answer · answered Apr 21 '22 at 17:14

1

It seems you could use drop_duplicates with different rule depending on type:

out = pd.concat([df.query("type=='A'").drop_duplicates(subset=['id','type'], keep='first'), 
                 df.query("type=='B'").drop_duplicates(subset=['id','type'], keep='last')]).sort_index()

Output:

answered Apr 21 '22 at 17:14

Thanks, i will try this now. is there any advantage to using df.query over other filtering methods such as df[df[] ==''] – S44 Apr 21 '22 at 17:21
1

@Sebazz44 I just used it here for convenience in chaining but there's a [slight performance difference](https://stackoverflow.com/a/71626557/17521785) for large frames. – Apr 21 '22 at 17:35

mozway · Answer 2 · 2022-04-21T17:27:12.187

1

You could simply use masks to slice the DataFrame:

m1 = df['type'].eq('B')
# first non-duplicate
m2 = ~df.duplicated(keep='first')
# last non-duplicate
m3 = ~df.duplicated(keep='last')

df[(m1&m2).shift(-1)|(m1&m3)]

# (m1&m2).shift(-1) -> value before the first B (i.e an A)
# (m1&m3)           -> last B

output:

edited Apr 21 '22 at 17:27

answered Apr 21 '22 at 17:22

mozway

194,879
13
39
75

Ill try this as well, btw whats 'IIUC' – S44 Apr 21 '22 at 17:24
1

"If I Understand Correctly" ;) – mozway Apr 21 '22 at 17:25

Remove rows in a group by until the last row meets some condition

2 Answers2