Remove specific set of rows from each group in a dataframe

Question

I have a dataframe as follows :

df = pd.DataFrame({"user_id": ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],
                   "value": [20, 17,15, 10, 8 , 18, 18, 17, 13, 10]})

Notice that the dataframe is sorted in descending order by user_id then value.

For each user_id, I would like to remove the 2nd and 4th row so the output would look like

df = pd.DataFrame({"user_id": ['a', 'a', 'a', 'b', 'b', 'b',],
                   "value": [20, 15, 8 , 18, 17, 10]})

Inspired by drop first and last row from within each group, I tried the following :

def drop_rows(dataframe) : 
     pos = [1,3]
     return dataframe.drop(dataframe.index[pos], inplace=True)
df.groupby('user_id').apply(drop_rows)

But got this "index 2 is out of bounds for axis 0 with size 0"

Could someone explain why this doesn't work and how I should proceed instead ? Also, given that the dataset is quite huge, an efficient approach to the solution would be helpful. Thanks a lot.

score 4 · Accepted Answer · answered Jul 21 '20 at 15:49

4

You can use groupby+cumcount to get row count in each group then check if not the row is in the to_del list

to_del = [2,4]
df[~df.groupby('user_id').cumcount().add(1).isin(to_del)]

  user_id  value
0       a     20
2       a     15
4       a      8
5       b     18
7       b     17
9       b     10

answered Jul 21 '20 at 15:49

anky

74,114
11
41
70

1

awesome, I was thinking if there was another way to do this and i can't think of another without using factorize and apply surrogate keys are the way to go. – Umar.H Jul 21 '20 at 15:56
1

@Manakin I'd be interested to know too :) – anky Jul 21 '20 at 15:57

Remove specific set of rows from each group in a dataframe

1 Answers1