4

How can I remove duplicate rows, but keep ALL rows with the max value. For example, I have a dataframe with 4 rows:

data = [{'a': 1, 'b': 2, 'c': 3},{'a': 7, 'b': 10, 'c': 2}, {'a': 7, 'b': 2, 'c': 20}, {'a': 7, 'b': 2, 'c': 20}]
df = pd.DataFrame(data)

From this dataframe, I want to have a dataframe like (3 rows, group by 'a', keep all rows that have max value in 'c'):

data = [{'a': 1, 'b': 2, 'c': 3}, {'a': 7, 'b': 2, 'c': 20}, {'a': 7, 'b': 2, 'c': 20}]
df = pd.DataFrame(data)
jpp
  • 159,742
  • 34
  • 281
  • 339
Tuan Anh
  • 43
  • 5

2 Answers2

3

You can use GroupBy + transform with Boolean indexing:

res = df[df['c'] == df.groupby('a')['c'].transform('max')]

print(res)

   a  b   c
0  1  2   3
1  7  2  20
2  7  2  20
jpp
  • 159,742
  • 34
  • 281
  • 339
2

You can calculate the max c per group using groupby and transform and then filter where your record is equal to the max like:

df['max_c'] = df.groupby('a')['c'].transform('max')
df[df['c']==df['max_c']].drop(['max_c'], axis=1)
Franco Piccolo
  • 6,845
  • 8
  • 34
  • 52