Clean duplicates based on multiple conditions

Question

I have a df of fruit purchases sorted by date. I want to drop duplicates by fruit. But the way to drop duplicates depend on the column. The solution needs to generalise to more columns. But the 3 types of operations remain the same:

For each fruit:

price column should be the highest sold price
date, place and colour columns should be the most recent value that isn't NaN
qty should be the average number sold

df fruit      date   price  place   colour   qty
0  Apple  25-12-2023   4     NaN    Green    5
1  Apple  22-11-2023   5    London   Red     6 
2  Apple  20-10-2023   6    Paris    NaN     8 
3  Pear   19-10-2023   4    Sweden   Red     8
4  Pear   18-10-2023   5    London   Green   8
5  Pear   17-10-2023   10   Paris   Purple   9

Expected Output:

   fruit     date       price   place   colour   qty
   Apple   25-12-2023    6     London   Green    6.33 (5+6=8/3)
   Pear    19-10-2023    10    Sweden   Red      8.33 (8+8+9/3)

The logic is unclear, can you have multiple conditions? Please provide a complete example without `...` and the exact matching expected output — mozway, Jul 27 '23 at 14:36
Try to breakdown the problem into parts. Solve one condition at a time. @asd — Vishnudev Krishnadas, Jul 27 '23 at 14:37
OK, then it's a well known duplicate, use `groupby.idxmax`/`idxmin` — mozway, Jul 27 '23 at 14:41
You can sort it first based on price and then you can remove duplicate ones by keeping the first occurrence df = pd.DataFrame(data) df = df.sort_values(by='price', ascending=False) duplicate = df[df.duplicated('fruit')] df = df.drop_duplicates('fruit', keep="first") print(df) — Dayananda D R, Jul 27 '23 at 14:57
@asd then the logic is unclear, please clarify and improve the example — mozway, Jul 27 '23 at 14:59
@mozway, sorry it was unclear, I have rewritten the whole q. — asd, Jul 27 '23 at 21:10
Assuming the dates are already sorted, use a groupby.agg with different aggregation functions (min/max/first/first/mean) — mozway, Jul 28 '23 at 04:02

Clean duplicates based on multiple conditions

0 Answers0