Keeping only rows with most recent date in dataframe

Question

I have a dataframe formatted like this in pandas.

(df)
School ID      Num          Status       Modified Date
School 1       6000         Active       2020-07-18
School 1       6000         InActive     2020-10-05
School 2       9-999        Active       2020-03-30
School 2       9-999        Active       2020-10-14
School 2       9-999        InActive     2020-07-21
School 3       7000         Active       2020-07-18
School 3       7000         InActive     2020-09-05

....

I am trying to create a function using sort() that will sort the rows that will only keep the rows with the most recent dates in the dataframe. So this would be the result.

(df)
School ID      Num          Status       Modified Date
School 1       6000         InActive     2020-10-05
School 2       9-999        Active       2020-10-14
School 3       7000         InActive     2020-09-05
....

I would like to use the sort() function, and then maybe drop duplicates of columns Num and Status, but I am a bit stuck. Thanks.

score 1 · Accepted Answer · answered Oct 28 '20 at 13:40

This can be done by sort_values & drop_duplicates:

df = df.sort_values(by=['Modified Date'], ascending=False)
df = drop_duplicates(subset='School ID', keep='first)

Where the sort ensures that for each school the newest date will appear first, and the drop duplicates takes the first appearance of each school, which is the newest.

score 1 · Answer 2 · answered Oct 28 '20 at 13:46

Here's another way using groupby:

df['Modified Date'] = pd.to_datetime(df['Modified Date'])
df.sort_values('Modified Date', inplace=True, ascending=False)
df = df.groupby(by=['School ID'], as_index=False).first()
print(df)

  School ID    Num    Status Modified Date
0  School 1   6000  InActive    2020-10-05
1  School 2  9-999    Active    2020-10-14
2  School 3   7000  InActive    2020-09-05

Keeping only rows with most recent date in dataframe

2 Answers2