0

I have a dataframe with ID and Date column, and need to filter the most recent dates for each id. However in this dataframe I have more than one row with the same date, and i need to keep it because the data in the other columns. The date is as object type.

When filtering using sort or keep='first' I can't make this separation work

df = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
                   'info':['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'],
                   'date': ['2023-01-09', '2023-01-09', '2023-01-01', '2023-01-01', 
                            '2023-01-06', '2023-02-02','2023-01-10', '2023-02-02', 
                            '2023-01-10', '2022-12-20']})

df.sort_values(by=['date']).drop_duplicates(subset='id', keep='first')
id  info    date
1   a   2023-01-09
1   b   2023-01-09
1   c   2023-01-01
1   d   2023-01-01
1   e   2023-01-06
2   f   2023-02-02
2   g   2023-01-10
2   h   2023-02-02
2   i   2023-01-10
2   j   2022-12-20
id       int64
info    object
date    object
dtype: object

the result we need:

id  info    date
1   a   2023-01-09
1   b   2023-01-09
2   f   2023-02-02
2   h   2023-02-02

1 Answers1

0

Look in to pandas groupby. groupby date and id, sort by date, then take first index.

winderland
  • 362
  • 3
  • 7