How would you go about finding the latest date in a dataframe groupby?

Question

I am attempting to create a sub-table from a larger dataset which lists out the unique ID, the name of the person and the date they attended an appointment.

For example,

df = pd.DataFrame({'ID': ['abc', 'def', 'abc', 'abc'], 
                   'name':['Alex','Bertie','Alex','Alex'],
                   'date_attended':['01/01/2021','05/01/2021','11/01/2021','20/01/2021']

What I would like is a dataframe, that shows the last time Alex and Bertie attended a class. So my dataframe would like:

name      date_attended
Alex      20/01/2021
Bertie    05/01/2021

I'm really struggling with this! So far I have tried (based off a previous question I saw here):

max_date_list = ['ID','date_attended']
df = df.groupby(['ID'])[max_date_list].transform('max').size()

but I keep getting an error. I know this would involve a groupby but I can't figure out how to get the maximum date. Would anyone know how to do this?

does `df.groupby(['ID', 'name'])['date_attended'].max()` not work? — Matthew Barlowe, Sep 08 '21 at 16:30
The error I am getting is :numpy.int64 is not instance of int. — ds_1234, Sep 08 '21 at 16:40
Thank you! I changed the type date_attended was and the other suggestions worked! Thank you all! — ds_1234, Sep 08 '21 at 16:45

score 1 · Answer 1 · answered Sep 08 '21 at 16:32

To match your expected output format exactly, you might want to groupby "name":

>>> df.groupby("name")["date_attended"].max()
name
Alex      20/01/2021
Bertie    05/01/2021
Name: date_attended, dtype: object

Alternatively, if you might have different ID with the same name:

>>> df.groupby("ID").agg({"name": "first", "date_attended": "max"}).set_index("name")
       date_attended
name                
Alex      20/01/2021
Bertie    05/01/2021

score 1 · Accepted Answer · answered Sep 08 '21 at 16:37

Try sort_values by 'date_attended' and drop_duplicates by 'ID':

df['date_attended'] = pd.to_datetime(df['date_attended'], dayfirst=True)     
df.sort_values('date_attended', ascending=False).drop_duplicates('ID')

Output:

    ID    name date_attended
3  abc    Alex    2021-01-20
1  def  Bertie    2021-01-05

How would you go about finding the latest date in a dataframe groupby?

2 Answers2