Pandas: how to remove duplicate rows, but keep ALL rows with max value

Question

How can I remove duplicate rows, but keep ALL rows with the max value. For example, I have a dataframe with 4 rows:

data = [{'a': 1, 'b': 2, 'c': 3},{'a': 7, 'b': 10, 'c': 2}, {'a': 7, 'b': 2, 'c': 20}, {'a': 7, 'b': 2, 'c': 20}]
df = pd.DataFrame(data)

From this dataframe, I want to have a dataframe like (3 rows, group by 'a', keep all rows that have max value in 'c'):

data = [{'a': 1, 'b': 2, 'c': 3}, {'a': 7, 'b': 2, 'c': 20}, {'a': 7, 'b': 2, 'c': 20}]
df = pd.DataFrame(data)

score 3 · Accepted Answer · answered Nov 02 '18 at 11:48

3

You can use GroupBy + transform with Boolean indexing:

res = df[df['c'] == df.groupby('a')['c'].transform('max')]

print(res)

   a  b   c
0  1  2   3
1  7  2  20
2  7  2  20

answered Nov 02 '18 at 11:48

jpp

score 2 · Answer 2 · answered Nov 02 '18 at 09:13

2

You can calculate the max c per group using groupby and transform and then filter where your record is equal to the max like:

df['max_c'] = df.groupby('a')['c'].transform('max')
df[df['c']==df['max_c']].drop(['max_c'], axis=1)

answered Nov 02 '18 at 09:13

Franco Piccolo

Thank you. I replace the second command with df = df.loc[df['c'] == df['max_c']] then it works. – Tuan Anh Nov 02 '18 at 09:37
Welcome! Accept the answer if it solved the question. – Franco Piccolo Nov 02 '18 at 09:39

2 Answers2