0

I am attempting to create a sub-table from a larger dataset which lists out the unique ID, the name of the person and the date they attended an appointment.

For example,

df = pd.DataFrame({'ID': ['abc', 'def', 'abc', 'abc'], 
                   'name':['Alex','Bertie','Alex','Alex'],
                   'date_attended':['01/01/2021','05/01/2021','11/01/2021','20/01/2021']

What I would like is a dataframe, that shows the last time Alex and Bertie attended a class. So my dataframe would like:

name      date_attended
Alex      20/01/2021
Bertie    05/01/2021

I'm really struggling with this! So far I have tried (based off a previous question I saw here):

max_date_list = ['ID','date_attended']
df = df.groupby(['ID'])[max_date_list].transform('max').size()

but I keep getting an error. I know this would involve a groupby but I can't figure out how to get the maximum date. Would anyone know how to do this?

ds_1234
  • 57
  • 5

2 Answers2

1

To match your expected output format exactly, you might want to groupby "name":

>>> df.groupby("name")["date_attended"].max()
name
Alex      20/01/2021
Bertie    05/01/2021
Name: date_attended, dtype: object

Alternatively, if you might have different ID with the same name:

>>> df.groupby("ID").agg({"name": "first", "date_attended": "max"}).set_index("name")
       date_attended
name                
Alex      20/01/2021
Bertie    05/01/2021
not_speshal
  • 22,093
  • 2
  • 15
  • 30
1

Try sort_values by 'date_attended' and drop_duplicates by 'ID':

df['date_attended'] = pd.to_datetime(df['date_attended'], dayfirst=True)     
df.sort_values('date_attended', ascending=False).drop_duplicates('ID')

Output:

    ID    name date_attended
3  abc    Alex    2021-01-20
1  def  Bertie    2021-01-05
Scott Boston
  • 147,308
  • 15
  • 139
  • 187