1

Say I have a dataframe like

a  b  c
1  2  3
1  2  3
.
.

and I want to allow, say, 100 duplicated values of a and b pairs i.e say theres 200 pairs of a=1 and b=2 then I want to keep 100 of those.

I cannot use duplicated on a GroupBy dataframe, thus I'm rather lost on how to solve this

CutePoison
  • 4,679
  • 5
  • 28
  • 63

4 Answers4

2
# n: number of duplicates to keep
df.groupby(['a', 'b'], as_index=False).head(n)
Marat
  • 15,215
  • 2
  • 39
  • 48
1

I believe that you can do it that way:

max_duplicates = 200
group_cols = ['a', 'b'] 

duplicates = df.duplicated(subset=group_cols, keep='first')

# get groups of duplicated rows subsets
groups = df[duplicates].groupby(group_cols)

# join rows without duplicates and allowed number of duplicated rows from each group 
df_clean = pd.concat([groups.head(max_duplicates), df[~duplicates]])
Matmozaur
  • 283
  • 2
  • 6
1

One options is to group by a, b. Do a cumcount and then filter. Example:

df
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
3  1  2  4
4  2  2  1
5  2  2  2

To keep the first 3 rows:

df[df.groupby(['a', 'b']).cumcount() <= 2]
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
4  2  2  1
5  2  2  2
Psidom
  • 209,562
  • 33
  • 339
  • 356
0

You can achieve this by using the groupby method in conjunction with the head method in pandas. Here's a solution to keep only the first 100 duplicates for each pair of 'a' and 'b':

import pandas as pd

# Your example DataFrame
data = {'a': [1, 1, 2, 2, 1], 'b': [2, 2, 3, 3, 2], 'c': [3, 3, 4, 4, 3]}
df = pd.DataFrame(data)

# Set the number of duplicates you want to keep
num_dups_to_keep = 100

# Group the DataFrame by columns 'a' and 'b', and keep only the first 'num_dups_to_keep' rows for each group
result = df.groupby(['a', 'b']).head(num_dups_to_keep)

# Reset the index
result = result.reset_index(drop=True)

print(result)

This code snippet will group the DataFrame by the 'a' and 'b' columns, and then keep only the first 100 rows for each group. If you have less than 100 duplicates for a specific pair, it will keep all of them.