How to remove duplicates in two columns in a DataFrame by comparing first the value in the third column?

Question

For example, I have the DataFrame:

a = [{'column_1': 'A', 'column_2': 'B', 'column_3': 20.14}, {'column_1': 'A', 'column_2': 'B', 'column_3': 20.35}]
df = pd.DataFrame(a)

I need to drop the duplicate using two columns -> df.drop_duplicate(['column_1', 'column_2']) but use the following conditions.

First I need to compare the value in the df['column_3'] column, and keep the entry that is lower in value, in this case 20.14

There may be more than two duplicates in a real table.

Scott Boston · Answer 1 · 2023-01-18T18:53:32.167

2

Sort dataframe first using sort_values, then drop_duplicates, keeping the first (lowest value column_3) record.

df.sort_values(['column_3']).drop_duplicates(['column_1', 'column_2'])

Another way, capturing more than one minimum record:

df[df['column_3'] == df.groupby(['column_1', 'column_2'])['column_3'].transform('min')]

or just want one record:

df.groupby(['column_1', 'column_2'], as_index=False)['column_3'].min()

edited Jan 18 '23 at 18:53

answered Jan 18 '23 at 18:43

Scott Boston

thanks for answer, ok, it's work, but how to do this doesn't use the sort_values, can it be done in some other way? – LiAfe Jan 18 '23 at 18:49

score 2 · Answer 2 · answered Jan 18 '23 at 18:52

2

You can use groupby on 'column_1', 'column_2' and then find min on column_3.

df.groupby(['column_1', 'column_2'])['column_3'].min().to_frame().reset_index()

Output:

  column_1 column_2  column_3
0        A        B     20.14

answered Jan 18 '23 at 18:52

I'mahdi

2 Answers2