Group by multiple columns and then only take the most recent date of the unique name value and all of the columns associated with it
Data
ID name size stat days month date year
db11AA cc 5 TRUE 10 June 6/1/2023 2023
db11AA kj 9 FALSE 10 June 6/5/2023 2023
db11AA cc 7 TRUE 10 June 6/2/2023 2023
db11AA aa 2 TRUE 60 June 6/2/2023 2023
db22BB bb 1 TRUE 10 June 6/30/2023 2023
db22BB vl 2 FALSE 60 June 6/29/2023 2023
db11BB ss 2 FALSE 10 April 4/2/2023 2023
db11BB ss 2 FALSE 10 April 4/1/2023 2023
db67CC la 1 FALSE 60 June 6/3/2024 2024
db67CC la 0 FALSE 60 June 6/5/2024 2024
db11AA cc 20 TRUE 10 May 5/1/2023 2024
db11AA kj 30 FALSE 10 May 5/5/2023 2024
Desired
ID name size stat days month date year
db11AA cc 7 TRUE 10 June 6/2/2023 2023
db11AA kj 9 FALSE 10 June 6/5/2023 2023
db11AA aa 2 TRUE 60 June 6/2/2023 2023
db22BB bb 1 TRUE 10 June 6/30/2023 2023
db22BB vl 2 FALSE 60 June 6/29/2023 2023
db11BB ss 2 FALSE 10 April 4/2/2023 2023
db67CC la 0 FALSE 60 June 6/5/2024 2024
db11AA cc 20 TRUE 10 May 5/1/2023 2024
db11AA kj 30 FALSE 10 May 5/5/2023 2024
Logic: We can have duplicate ID's, but it is the name value that must be unique and showing the most recent date.
Doing
# Group the DataFrame by 'ID' and 'month' and select the row with the maximum 'size' value
df = df.groupby(['ID', 'month']).apply(lambda x: x.loc[x['date'].idxmax()])
I think I should use lambda not certain as the rows are still giving duplicates with the script above. Any suggestion is appreciated.