Pandas: Find duplicates and modify them based on date

Question

I have a dataframe and I want to find the duplicates based on Color and Price. Then change the code with the code of the most recent(based on the Date) row. I don't want any row to be deleted. The dataframe is this:

id   Color   Price   Code       Date 
1   White    1.50     111   3-22-2017 12:00
2   Green    2.20     222   3-23-2017 09:55
3   Black    3.00     333   3-24-2017 11:45
4   White    1.50     111   3-23-2017 10:20
5   White    1.50     444   3-23-2017 08:15
6   Green    2.20     555   3-25-2017 07:05

the result should be this:

   id   Color   Price   Code       Date 
    1   White    1.50     111   3-22-2017 12:00
    2   Green    2.20     **555**   3-23-2017 09:55
    3   Black    3.00     333   3-24-2017 11:45
    4   White    1.50     111   3-23-2017 10:20
    5   White    1.50     **111**   3-23-2017 08:15
    6   Green    2.20     555   3-25-2017 07:05

I know that the answer is close to the answer here but the form of the date confuses me.

I think need `111` instead `444`, because more resent date is `3-23-2017 10:20` in `white, 1.5` group — jezrael, Aug 17 '17 at 12:21

cs95 · Answer 1 · 2017-08-17T14:07:04.793

Try using df.groupby followed by dfGroupBy.transform to pick the last value:

In [406]: df.Code = df.sort_values('Date')\
                      .groupby(['Color', 'Price']).Code.transform('last') # faster than lambda 

In [407]: df
Out[407]: 
   id  Color  Price  Code             Date
0   1  White    1.5   111  3-22-2017 12:00
4   5  White    1.5   111  3-23-2017 08:15
1   2  Green    2.2   555  3-23-2017 09:55
3   4  White    1.5   111  3-23-2017 10:20
2   3  Black    3.0   333  3-24-2017 11:45
5   6  Green    2.2   555  3-25-2017 07:05

Make sure your dataframe is sorted by Date, by calling df.sort_values first.

Also, as jezrael pointed out, I think your expected output is incorrect. To get the correct output, you must ensure that Date is of pd.datetime type. Convert it if it isn't.

JoeCondron · Answer 2 · 2017-08-17T14:44:14.387

2

For completeness, here is a solution that doesn't require sorting by date. You can get the most recent row for each color-price pair by

key = [df.Color, df.Price]; 
most_recent = df.groupby(key).Date.transform(max)

Then do

code = df.Code.where(df.Date == most_recent).groupby(key).transform(max)

edited Aug 17 '17 at 14:44

answered Aug 17 '17 at 12:16

JoeCondron

8,546
3
27
28

I would argue passing `lambda` to `transform` isn't a good idea. This solution is around 2x faster even when the data is already sorted by date. I chose to solution such that the data wasn't sorted by date. – JoeCondron Aug 17 '17 at 14:05

jezrael · Accepted Answer · 2017-08-17T12:35:54.663

I think you need sort_values with GroupBy.transform and function last:

#if not datetime, convert it
df['Date'] = pd.to_datetime(df['Date'])

df['Code'] = df.sort_values('Date').groupby(['Color', 'Price'])['Code'].transform('last')
print (df)
   id  Color  Price  Code                Date
0   1  White    1.5   111 2017-03-22 12:00:00
1   2  Green    2.2   555 2017-03-23 09:55:00
2   3  Black    3.0   333 2017-03-24 11:45:00
3   4  White    1.5   111 2017-03-23 10:20:00
4   5  White    1.5   111 2017-03-23 08:15:00
5   6  Green    2.2   555 2017-03-25 07:05:00

Pandas: Find duplicates and modify them based on date

3 Answers3