0

I have a dataframe and I want to find the duplicates based on Color and Price. Then change the code with the code of the most recent(based on the Date) row. I don't want any row to be deleted. The dataframe is this:

id   Color   Price   Code       Date 
1   White    1.50     111   3-22-2017 12:00
2   Green    2.20     222   3-23-2017 09:55
3   Black    3.00     333   3-24-2017 11:45
4   White    1.50     111   3-23-2017 10:20
5   White    1.50     444   3-23-2017 08:15
6   Green    2.20     555   3-25-2017 07:05

the result should be this:

   id   Color   Price   Code       Date 
    1   White    1.50     111   3-22-2017 12:00
    2   Green    2.20     **555**   3-23-2017 09:55
    3   Black    3.00     333   3-24-2017 11:45
    4   White    1.50     111   3-23-2017 10:20
    5   White    1.50     **111**   3-23-2017 08:15
    6   Green    2.20     555   3-25-2017 07:05

I know that the answer is close to the answer here but the form of the date confuses me.

cs95
  • 379,657
  • 97
  • 704
  • 746
dimosbele
  • 381
  • 3
  • 19

3 Answers3

3

Try using df.groupby followed by dfGroupBy.transform to pick the last value:

In [406]: df.Code = df.sort_values('Date')\
                      .groupby(['Color', 'Price']).Code.transform('last') # faster than lambda 

In [407]: df
Out[407]: 
   id  Color  Price  Code             Date
0   1  White    1.5   111  3-22-2017 12:00
4   5  White    1.5   111  3-23-2017 08:15
1   2  Green    2.2   555  3-23-2017 09:55
3   4  White    1.5   111  3-23-2017 10:20
2   3  Black    3.0   333  3-24-2017 11:45
5   6  Green    2.2   555  3-25-2017 07:05

Make sure your dataframe is sorted by Date, by calling df.sort_values first.

Also, as jezrael pointed out, I think your expected output is incorrect. To get the correct output, you must ensure that Date is of pd.datetime type. Convert it if it isn't.

cs95
  • 379,657
  • 97
  • 704
  • 746
2

For completeness, here is a solution that doesn't require sorting by date. You can get the most recent row for each color-price pair by

key = [df.Color, df.Price]; 
most_recent = df.groupby(key).Date.transform(max)

Then do

code = df.Code.where(df.Date == most_recent).groupby(key).transform(max)
JoeCondron
  • 8,546
  • 3
  • 27
  • 28
  • I would argue passing `lambda` to `transform` isn't a good idea. This solution is around 2x faster even when the data is already sorted by date. I chose to solution such that the data wasn't sorted by date. – JoeCondron Aug 17 '17 at 14:05
2

I think you need sort_values with GroupBy.transform and function last:

#if not datetime, convert it
df['Date'] = pd.to_datetime(df['Date'])

df['Code'] = df.sort_values('Date').groupby(['Color', 'Price'])['Code'].transform('last')
print (df)
   id  Color  Price  Code                Date
0   1  White    1.5   111 2017-03-22 12:00:00
1   2  Green    2.2   555 2017-03-23 09:55:00
2   3  Black    3.0   333 2017-03-24 11:45:00
3   4  White    1.5   111 2017-03-23 10:20:00
4   5  White    1.5   111 2017-03-23 08:15:00
5   6  Green    2.2   555 2017-03-25 07:05:00
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252