1

I have a dataframe with two columns ID and Salary

data = {'ID':[1,2,3,4,2],'salary':[1e3,1.2e3,1e3,2e3,1.5e3]}

+----+--------+  
| ID | salary |  
+----+--------+  
| 1  | 1000.0 |  
+----+--------+  
| 2  | 1200.0 |  
+----+--------+  
| 3  | 1000.0 |  
+----+--------+      
| 4  | 2000.0 |     
+----+--------+      
| 2  | 1500.0 |  
+----+--------+  

In this data frame some ID are duplicated, ID=2 in this case. I want to keep the highest salary for each duplicate ID.

+----+--------+  
| ID | salary |  
+----+--------+  
| 1  | 1000.0 |  
+----+--------+  
| 2  | 1500.0 |  
+----+--------+  
| 3  | 1000.0 |  
+----+--------+      
| 4  | 2000.0 |     
+----+--------+      
Zeugma
  • 31,231
  • 9
  • 69
  • 81
Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181

3 Answers3

3

You can perform a groupby on 'ID' and take the maximum:

df = df.groupby('ID', as_index=False)['salary'].max()

The resulting output:

   ID  salary
0   1  1000.0
1   2  1500.0
2   3  1000.0
3   4  2000.0
root
  • 32,715
  • 6
  • 74
  • 87
3

This question is a duplicate from Python : Getting the Row which has the max value in groups using groupby

Here's a one-liner:

df.groupby('ID', sort=False)['salary'].max()

And here's another one:

df.sort_values('salary', ascending=False).drop_duplicates(['ID'])    
Community
  • 1
  • 1
rafaelvalle
  • 6,683
  • 3
  • 34
  • 36
1

a more general solution for dataframes with more than two columns in which you want all rows with the max salary per id.

data = {'ID':[1,2,3,4,2],'salary':[1e3,1.2e3,1e3,2e3,1.5e3]}
df = pd.DataFrame(data).assign(more_data=range(5))

df.loc[df.groupby('ID').salary.idxmax()]

   ID  salary  more_data
0   1  1000.0          0
4   2  1500.0          4
2   3  1000.0          2
3   4  2000.0          3
piRSquared
  • 285,575
  • 57
  • 475
  • 624