Remove the the obersvations which is more than the i'th duplicated observation pandas

Question

Say I have a dataframe like

and I want to allow, say, 100 duplicated values of a and b pairs i.e say theres 200 pairs of a=1 and b=2 then I want to keep 100 of those.

I cannot use duplicated on a GroupBy dataframe, thus I'm rather lost on how to solve this

*200 pairs of a=1 and b=2* could be scattered over column span or they go in order? — RomanPerekhrest, May 06 '23 at 18:32

score 2 · Accepted Answer · answered May 06 '23 at 18:42

2

# n: number of duplicates to keep
df.groupby(['a', 'b'], as_index=False).head(n)

answered May 06 '23 at 18:42

Marat

15,215
2
39
48

score 1 · Answer 2 · answered May 06 '23 at 18:37

I believe that you can do it that way:

max_duplicates = 200
group_cols = ['a', 'b'] 

duplicates = df.duplicated(subset=group_cols, keep='first')

# get groups of duplicated rows subsets
groups = df[duplicates].groupby(group_cols)

# join rows without duplicates and allowed number of duplicated rows from each group 
df_clean = pd.concat([groups.head(max_duplicates), df[~duplicates]])

score 1 · Answer 3 · answered May 06 '23 at 18:38

1

One options is to group by a, b. Do a cumcount and then filter. Example:

To keep the first 3 rows:

df[df.groupby(['a', 'b']).cumcount() <= 2]
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
4  2  2  1
5  2  2  2

answered May 06 '23 at 18:38

Psidom

209,562
33
339
356

Small correction, since we count from 0, it has to be `<2` if we want the top 2. – CutePoison May 08 '23 at 05:43

score 0 · Answer 4 · answered May 06 '23 at 19:00

You can achieve this by using the groupby method in conjunction with the head method in pandas. Here's a solution to keep only the first 100 duplicates for each pair of 'a' and 'b':

import pandas as pd

# Your example DataFrame
data = {'a': [1, 1, 2, 2, 1], 'b': [2, 2, 3, 3, 2], 'c': [3, 3, 4, 4, 3]}
df = pd.DataFrame(data)

# Set the number of duplicates you want to keep
num_dups_to_keep = 100

# Group the DataFrame by columns 'a' and 'b', and keep only the first 'num_dups_to_keep' rows for each group
result = df.groupby(['a', 'b']).head(num_dups_to_keep)

# Reset the index
result = result.reset_index(drop=True)

print(result)

This code snippet will group the DataFrame by the 'a' and 'b' columns, and then keep only the first 100 rows for each group. If you have less than 100 duplicates for a specific pair, it will keep all of them.

Remove the the obersvations which is more than the i'th duplicated observation pandas

4 Answers4