So, I have a file that gets generated at runtime. A sample of the file looks like this:
ID,Class_id,Column_A,Column_B,Column_C,Column_D,Mask
1,987,vermont,CA,450,liase,2
2,456,WB,cloo,452,var,1
3,987,CA,Cp,1000000,liase,2
4,456,SA,Cap,98376,clop,1
5,765,IN,clas,543,king,2
6,987,SA,CLA,200,loop,2
7,456,BEG,loop,876,var,1
As you can see, duplicate elements for Class_id
are present. Also Mask
specifies the maximum number of duplicate elements that can be present in the file.
What I'm trying to do is to remove the last occurence of the duplicate element one by one until the number of duplicate records is same as the its Mask
value.
In case of above file,
987 of Class_id
is occuring 3 times. Its Mask
value is 2. So it can occur, at maximum, 2 times only. So i need to remove the last occurence of 987 which is the 6th record. The order of the records in the file is irrelevant here.
The output I'm trying to get at is like this:
ID,Class_id,Column_A,Column_B,Column_C,Column_D,Mask
1,987,vermont,CA,450,liase,2
3,987,CA,Cp,1000000,liase,2
2,456,WB,cloo,452,var,5
5,765,IN,clas,543,king,2
I scoured through this site, and yet couldn't get a viable solution. These are the sites I referenced;
Pandas: remove reverse duplicates from dataframe Find Duplicates limited to multiple ranges - pandas python pandas remove duplicate columns How to conditionally remove duplicates from a pandas dataframe Drop all duplicate rows in Python Pandas
I noticed that Python has a drop_duplicates
function. Nut how can i limit the number of duplicates to remove?
Could somebody help out a newbie here please. Thanks.