What I am looking for is a function that works exactly like pandas.DataFrame.drop_duplicates() but that allows me to keep not only the first occurence but the first 'x' occurences (say like 10). Does anything like that exist? Thanks for your help!
Asked
Active
Viewed 2,423 times
1 Answers
3
IIUC, One way to do this would be with a groupby
and head
, to select the first x occurrences. As noted in the docs, head
:
Returns first n rows of each group.
Sample code:
x = 10
df.groupby('col').head(x)
Where col
is the column you want to check for duplicates, and x
is the number of occurrences you want to keep for each value in col
For instance:
In [81]: df.head()
Out[81]:
a b
0 3 0.912355
1 3 2.091888
2 3 -0.422637
3 1 -0.293578
4 2 -0.817454
....
# keep 3 first instances of each value in column a:
x = 3
df.groupby('a').head(x)
Out[82]:
a b
0 3 0.912355
1 3 2.091888
2 3 -0.422637
3 1 -0.293578
4 2 -0.817454
5 1 1.476599
6 1 0.898684
8 2 -0.824963
9 2 -0.290499

sacuL
- 49,704
- 8
- 81
- 106
-
Yes, that's exactly what I was looking for. It perfectly solves the problem. Thanks! – Fulvio Feb 19 '19 at 02:40