2

I have a dataframe with duplicate rows except for one value. I want to filter them out and only keep the row with the higher value.

User_ID - Skill - Year_used
1 - skill_a - 2017
1 - skill_b - 2015
1 - skill_a - 2018
2 - skill_c - 2011

etc.

So for example rows with skill_a and the same User_ID need to be compared and only the one with the latest year should be kept.

transform.('count')

Only gives me the amount of rows of the group by User_ID.

value_counts()

Only gives me a series I can't merge back to the df.

Nay ideas?

Thank you

Sam_Ste
  • 334
  • 2
  • 16

2 Answers2

1

One option is to groupby the Skill and keep the max Year_used:

df.groupby(['User_ID','Skill']).Year_used.max().reset_index()

     User_ID    Skill  Year_used
0        1  skill_a       2018
1        1  skill_b       2015
2        2  skill_c       2011
yatu
  • 86,083
  • 12
  • 84
  • 139
1

You can use drop_duplicates by sorting a column to keep max

df = df.sort_values('Year_used').drop_duplicates(['User_ID','Skill'], keep='last')
Sociopath
  • 13,068
  • 19
  • 47
  • 75