2

So, I have a file that gets generated at runtime. A sample of the file looks like this:

ID,Class_id,Column_A,Column_B,Column_C,Column_D,Mask
1,987,vermont,CA,450,liase,2
2,456,WB,cloo,452,var,1
3,987,CA,Cp,1000000,liase,2
4,456,SA,Cap,98376,clop,1
5,765,IN,clas,543,king,2
6,987,SA,CLA,200,loop,2
7,456,BEG,loop,876,var,1

As you can see, duplicate elements for Class_id are present. Also Mask specifies the maximum number of duplicate elements that can be present in the file.

What I'm trying to do is to remove the last occurence of the duplicate element one by one until the number of duplicate records is same as the its Mask value.

In case of above file,

987 of Class_id is occuring 3 times. Its Mask value is 2. So it can occur, at maximum, 2 times only. So i need to remove the last occurence of 987 which is the 6th record. The order of the records in the file is irrelevant here.

The output I'm trying to get at is like this:

ID,Class_id,Column_A,Column_B,Column_C,Column_D,Mask
1,987,vermont,CA,450,liase,2
3,987,CA,Cp,1000000,liase,2
2,456,WB,cloo,452,var,5
5,765,IN,clas,543,king,2

I scoured through this site, and yet couldn't get a viable solution. These are the sites I referenced;

Pandas: remove reverse duplicates from dataframe Find Duplicates limited to multiple ranges - pandas python pandas remove duplicate columns How to conditionally remove duplicates from a pandas dataframe Drop all duplicate rows in Python Pandas

I noticed that Python has a drop_duplicates function. Nut how can i limit the number of duplicates to remove?

Could somebody help out a newbie here please. Thanks.

kiyah
  • 1,502
  • 2
  • 18
  • 27
Marek
  • 245
  • 1
  • 4
  • 15

3 Answers3

1

Use cumcount to address trimming excess rows off. Use pd.factorize + np.bincount to filter out rows with fewer rows than Mask

mask = df.Mask.values
f, u = pd.factorize(df.Class_id.values)

cond1 = df.groupby('Class_id').cumcount().lt(mask)
cond2 = np.bincount(f)[f] >= mask

df[cond1 & cond2]

   ID  Class_id Column_A Column_B  Column_C Column_D  Mask
0   1       987  vermont       CA       450    liase     2
1   2       456       WB     cloo       452      var     1
2   3       987       CA       Cp   1000000    liase     2

You'll notice that this is NOT what OP has as their desired output. That output is not consistent with number of rows being EQUAL to the value in the Mask column.


Another way using transform('size') instead of pd.factorize to eliminate rows whose count is too small.

g = df.groupby('Class_id')

mask = df.Mask.values
cond1 = g.cumcount().lt(mask)
cond2 = g.ID.transform('size').ge(mask)

df[cond1 & cond2]

   ID  Class_id Column_A Column_B  Column_C Column_D  Mask
0   1       987  vermont       CA       450    liase     2
1   2       456       WB     cloo       452      var     1
2   3       987       CA       Cp   1000000    liase     2
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Just commented. – piRSquared Feb 19 '18 at 06:49
  • I am confused, something missing? – jezrael Feb 19 '18 at 06:50
  • 1
    Based on OP's words "until the number of duplicate records is same as the its Mask value" There intention is ambiguous. I'm making an assumption that they meant for the number of elements of `'Class_id'` to be equal to the value in the `Mask` column. It may be that what you have is exactly what they wanted and is what I did in the first place. But I noticed this and want to address it. – piRSquared Feb 19 '18 at 06:53
1

Use boolean indexing with mask comparing Series by cumcount with column Mask and lt (<):

df = df[df.groupby('Class_id').cumcount().lt(df.Mask)]
print (df)

   ID  Class_id Column_A Column_B  Column_C Column_D  Mask
0   1       987  vermont       CA       450    liase     2
1   2       456       WB     cloo       452      var     1
2   3       987       CA       Cp   1000000    liase     2
4   5       765       IN     clas       543     king     2

Detail:

print (df.groupby('Class_id').cumcount())
0    0
1    0
2    1
3    1
4    0
5    2
6    2
dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Use cumcount

In [260]: df[df.groupby('Class_id').cumcount() < df['Mask']]
Out[260]:
   ID  Class_id Column_A Column_B  Column_C Column_D  Mask
0   1       987  vermont       CA       450    liase     2
1   2       456       WB     cloo       452      var     1
2   3       987       CA       Cp   1000000    liase     2
4   5       765       IN     clas       543     king     2
Zero
  • 74,117
  • 18
  • 147
  • 154