2

What I am looking for is a function that works exactly like pandas.DataFrame.drop_duplicates() but that allows me to keep not only the first occurence but the first 'x' occurences (say like 10). Does anything like that exist? Thanks for your help!

Fulvio
  • 33
  • 4

1 Answers1

3

IIUC, One way to do this would be with a groupby and head, to select the first x occurrences. As noted in the docs, head:

Returns first n rows of each group.

Sample code:

x = 10
df.groupby('col').head(x)

Where col is the column you want to check for duplicates, and x is the number of occurrences you want to keep for each value in col

For instance:

In [81]: df.head()
Out[81]:
   a         b
0  3  0.912355
1  3  2.091888
2  3 -0.422637
3  1 -0.293578
4  2 -0.817454
....

# keep 3 first instances of each value in column a:

x = 3
df.groupby('a').head(x)

Out[82]:
   a         b
0  3  0.912355
1  3  2.091888
2  3 -0.422637
3  1 -0.293578
4  2 -0.817454
5  1  1.476599
6  1  0.898684
8  2 -0.824963
9  2 -0.290499
sacuL
  • 49,704
  • 8
  • 81
  • 106