Pandas drop duplicated values partially

Question

I have a dataframe as

df=pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
                 'B':[0, 2, 3, 4, 5, 6, 7],
                 'C':[7, 2, 2, 5, 7, 2, 2]})

I would like to drop the duplicated values from columns A and C. However, I want it to work partially.

If I use

df.drop_duplicates(subset=['A','C'], keep='first')

It will drop row 2, 5, 6. However, I only want to drop row 2 and 6. The desired results are like:

df=pd.DataFrame({'A':[1, 3, 4, 5, 3],
                 'B':[0, 2, 4, 5, 6],
                 'C':[7, 2, 5, 7, 2]})

@yatu Since df.drop_duplicates(subset=['A','C'], keep='first') will drop the ALL the duplicated rows (2, 5, 6) and keep the first (1), BUt I ONLY want to drop row 2 and 6, that's what I mean by partially. — Pei Li, Apr 09 '20 at 16:38

score 2 · Accepted Answer · answered Apr 09 '20 at 17:01

2

Here's how you can do this, using shift:

df.loc[(df[["A", "C"]].shift() != df[["A", "C"]]).any(axis=1)].reset_index(drop=True)

Output:

This question is a nice reference.

answered Apr 09 '20 at 17:01

cosmic_inquiry

score 0 · Answer 2 · answered Apr 09 '20 at 17:11

0

You can just keep every second repetition of A, C pair:

df=df.loc[df.groupby(["A", "C"]).cumcount()%2==0]

Outputs:

answered Apr 09 '20 at 17:11

2 Answers2