How to selectively drop x number of rows for repeating entries in a dataframe using python?

Question

Imagine a dataframe, such as:

COL1             COL2                   COL3
foo              unique string          unique int
foo              unique string          unique int
foo              unique string          unique int
foo              unique string          unique int
foo              unique string          unique int
bar              unique string          unique int             
bar              unique string          unique int
bar              unique string          unique int
bar              unique string          unique int
qux              unique string          unique int
qux              unique string          unique int
biz              unique string          unique int

I'd like to set a rule where I only keep a max of 3 (random/first) rows for repeating items in COL1. I don't really care which row I drop, so it can be random, so it's a conscious choice.

I'm unsure how best to solve this elegantly, without having to loop over my df. I've looked into groupby, sort & drop.duplicates but I'm not too sure this is the way to go.

My best guess atm, is to sort by COL1, add COL4 where I count items by COL1(group), and then drop all rows where COL4 exceeds 3. But I'm not sure how to count items in a 4th col... Is this the way to go? Any suggestions?

Expected Output:

COL1             COL2                   COL3
foo              unique string          unique int
foo              unique string          unique int
foo              unique string          unique int
bar              unique string          unique int             
bar              unique string          unique int
bar              unique string          unique int
qux              unique string          unique int
qux              unique string          unique int
biz              unique string          unique int

score 1 · Accepted Answer · answered Dec 24 '20 at 04:45

1

You can use GroupBy.head here to take 1st three value from a group or GroupBy.tail to take last three.

df.groupby('COL1').head(3)

   COL1           COL2        COL3
0   foo  unique string  unique int
1   foo  unique string  unique int
2   foo  unique string  unique int
5   bar  unique string  unique int
6   bar  unique string  unique int
7   bar  unique string  unique int
9   qux  unique string  unique int
10  qux  unique string  unique int
11  biz  unique string  unique int

answered Dec 24 '20 at 04:45

Ch3steR

20,090
4
28
58

Shoot. This is seriously stupid easy. Thank youuu! :D. Accepting this answer in a few minutes. Also, here's a related thread/solution: https://stackoverflow.com/questions/37997668/pandas-number-rows-within-group-in-increasing-order – LucaVJ Dec 24 '20 at 04:49

How to selectively drop x number of rows for repeating entries in a dataframe using python?

1 Answers1