Remove duplicate rows with one different value

Question

I have a dataframe with duplicate rows except for one value. I want to filter them out and only keep the row with the higher value.

User_ID - Skill - Year_used
1 - skill_a - 2017
1 - skill_b - 2015
1 - skill_a - 2018
2 - skill_c - 2011

etc.

So for example rows with skill_a and the same User_ID need to be compared and only the one with the latest year should be kept.

transform.('count')

Only gives me the amount of rows of the group by User_ID.

value_counts()

Only gives me a series I can't merge back to the df.

Nay ideas?

Thank you

`df.drop_duplicates(subset=['User_ID,'Skill'],keep='last')` - does this work? — Umar.H, Feb 06 '19 at 12:05

yatu · Answer 1 · 2019-02-06T12:09:30.993

1

One option is to groupby the Skill and keep the max Year_used:

df.groupby(['User_ID','Skill']).Year_used.max().reset_index()

     User_ID    Skill  Year_used
0        1  skill_a       2018
1        1  skill_b       2015
2        2  skill_c       2011

edited Feb 06 '19 at 12:09

answered Feb 06 '19 at 12:04

yatu

86,083
12
84
139

score 1 · Accepted Answer · answered Feb 06 '19 at 12:07

1

You can use drop_duplicates by sorting a column to keep max

df = df.sort_values('Year_used').drop_duplicates(['User_ID','Skill'], keep='last')

answered Feb 06 '19 at 12:07

Sociopath

13,068
19
47
75

Remove duplicate rows with one different value

2 Answers2