Tricky obtain the latest date within unique groupings using Pandas

Question

Group by multiple columns and then only take the most recent date of the unique name value and all of the columns associated with it

Data

ID      name    size    stat    days    month   date        year
db11AA  cc      5       TRUE    10      June    6/1/2023    2023
db11AA  kj      9       FALSE   10      June    6/5/2023    2023
db11AA  cc      7       TRUE    10      June    6/2/2023    2023
db11AA  aa      2       TRUE    60      June    6/2/2023    2023
db22BB  bb      1       TRUE    10      June    6/30/2023   2023
db22BB  vl      2       FALSE   60      June    6/29/2023   2023
db11BB  ss      2       FALSE   10      April   4/2/2023    2023
db11BB  ss      2       FALSE   10      April   4/1/2023    2023
db67CC  la      1       FALSE   60      June    6/3/2024    2024
db67CC  la      0       FALSE   60      June    6/5/2024    2024
db11AA  cc      20      TRUE    10      May     5/1/2023    2024
db11AA  kj      30      FALSE   10      May     5/5/2023    2024

Desired

ID      name    size    stat    days    month   date        year
db11AA  cc      7       TRUE    10      June    6/2/2023    2023
db11AA  kj      9       FALSE   10      June    6/5/2023    2023
db11AA  aa      2       TRUE    60      June    6/2/2023    2023
db22BB  bb      1       TRUE    10      June    6/30/2023   2023
db22BB  vl      2       FALSE   60      June    6/29/2023   2023
db11BB  ss      2       FALSE   10      April   4/2/2023    2023
db67CC  la      0       FALSE   60      June    6/5/2024    2024
db11AA  cc      20      TRUE    10      May     5/1/2023    2024
db11AA  kj      30      FALSE   10      May     5/5/2023    2024

Logic: We can have duplicate ID's, but it is the name value that must be unique and showing the most recent date.

Doing

# Group the DataFrame by 'ID' and 'month' and select the row with the maximum 'size' value
df = df.groupby(['ID', 'month']).apply(lambda x: x.loc[x['date'].idxmax()])

I think I should use lambda not certain as the rows are still giving duplicates with the script above. Any suggestion is appreciated.

Can you please explain a little bit further? In your desired dataframe you have duplicate `db11AA` and `June` but you wrote that you want only the row with maximum `size` value. — Andrej Kesely, Jul 07 '23 at 23:06
yes @AndrejKesely because the name value differs. We can have duplicate ID's, but it is the name value that must be unique and showing the most recent date — Lynn, Jul 07 '23 at 23:30

score 1 · Accepted Answer · answered Jul 07 '23 at 23:39

IIUC, you can try:

df['date'] = pd.to_datetime(df['date'])

out = (
    df.groupby(["ID", "name", "month"], sort=False)
    .apply(lambda x: x.loc[x["date"].idxmax()])
    .reset_index(drop=True)
)
print(out)

Prints:

       ID name  size   stat  days  month       date  year
0  db11AA   cc     7   True    10   June 2023-06-02  2023
1  db11AA   kj     9  False    10   June 2023-06-05  2023
2  db11AA   aa     2   True    60   June 2023-06-02  2023
3  db22BB   bb     1   True    10   June 2023-06-30  2023
4  db22BB   vl     2  False    60   June 2023-06-29  2023
5  db11BB   ss     2  False    10  April 2023-04-02  2023
6  db67CC   la     0  False    60   June 2024-06-05  2024
7  db11AA   cc    20   True    10    May 2023-05-01  2024
8  db11AA   kj    30  False    10    May 2023-05-05  2024

ok will try - thank you for this solution – Lynn Jul 07 '23 at 23:45 — Lynn, Jul 07 '23 at 23:45

score 1 · Answer 2 · answered Jul 08 '23 at 05:05

You don't need to use apply here but you have to group by name too as I suggested in your previous (deleted) question:

idx = (df.assign(dt=pd.to_datetime(df['date']))
         .groupby(['name', 'ID', 'month'], sort=False)['dt']
         .idxmax())

filtered_df = df.loc[idx]

Output:

>>> filtered_df
        ID name  size   stat  days  month       date  year
2   db11AA   cc     7   True    10   June   6/2/2023  2023
1   db11AA   kj     9  False    10   June   6/5/2023  2023
3   db11AA   aa     2   True    60   June   6/2/2023  2023
4   db22BB   bb     1   True    10   June  6/30/2023  2023
5   db22BB   vl     2  False    60   June  6/29/2023  2023
6   db11BB   ss     2  False    10  April   4/2/2023  2023
9   db67CC   la     0  False    60   June   6/5/2024  2024
10  db11AA   cc    20   True    10    May   5/1/2023  2024
11  db11AA   kj    30  False    10    May   5/5/2023  2024

Tricky obtain the latest date within unique groupings using Pandas

2 Answers2