0

I have a dataframe as

df=pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
                 'B':[0, 2, 3, 4, 5, 6, 7],
                 'C':[7, 2, 2, 5, 7, 2, 2]})

I would like to drop the duplicated values from columns A and C. However, I want it to work partially.

If I use

df.drop_duplicates(subset=['A','C'], keep='first')

It will drop row 2, 5, 6. However, I only want to drop row 2 and 6. The desired results are like:

df=pd.DataFrame({'A':[1, 3, 4, 5, 3],
                 'B':[0, 2, 4, 5, 6],
                 'C':[7, 2, 5, 7, 2]})
Pei Li
  • 302
  • 3
  • 15
  • 1
    What do you mean by *discretely*? – yatu Apr 09 '20 at 16:32
  • @yatu I mean partially drop the duplicated values. – Pei Li Apr 09 '20 at 16:33
  • @yatu Since df.drop_duplicates(subset=['A','C'], keep='first') will drop the ALL the duplicated rows (2, 5, 6) and keep the first (1), BUt I ONLY want to drop row 2 and 6, that's what I mean by partially. – Pei Li Apr 09 '20 at 16:38

2 Answers2

2

Here's how you can do this, using shift:

df.loc[(df[["A", "C"]].shift() != df[["A", "C"]]).any(axis=1)].reset_index(drop=True)

Output:

   A  B  C
0  1  0  7
1  3  2  2
2  4  4  5
3  5  5  7
4  3  6  2

This question is a nice reference.

cosmic_inquiry
  • 2,557
  • 11
  • 23
0

You can just keep every second repetition of A, C pair:

df=df.loc[df.groupby(["A", "C"]).cumcount()%2==0]

Outputs:

   A  B  C
0  1  0  7
1  3  2  2
3  4  4  5
4  5  5  7
5  3  6  2
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34