-1

Let's suppose I have a dataset like this:

item_id |    date    | cat |
----------------------------
   0    | 2020-01-01 |  A  |
   0    | 2020-02-01 |  B  |
   1    | 2020-04-01 |  A  |
   2    | 2020-02-01 |  C  |
   2    | 2021-01-01 |  B  |

So, I need to get the last category (column cat), that means that the result dataframe would be the following:

item_id | cat |
---------------
   0    |  B  |
   1    |  A  |
   2    |  B  |

I know I could sort the values by the date and then iterate over the itens, but that would be too much consuming. Is there another method on pandas to achieve that?

Gabriel Caldas
  • 371
  • 2
  • 13

1 Answers1

3

Use drop_duplicates after sort_values:

>>> df.sort_values('date').drop_duplicates('item_id', keep='last')
   item_id        date cat
1        0  2020-02-01   B
2        1  2020-04-01   A
4        2  2021-01-01   B

Comment by @mozway:

Sorting is O(n*logn)

>>> df.loc[pd.to_datetime(df['date']).groupby(df['item_id'], sort=False).idxmax()]

   item_id        date cat
1        0  2020-02-01   B
2        1  2020-04-01   A
4        2  2021-01-01   B
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Sorting is `O(n*logn)`, use `df.loc[pd.to_datetime(df['date']).groupby(df['item_id'], sort=False).idxmax()]` ;) – mozway Feb 02 '23 at 17:18
  • `idxmax` doesn't work on string, `sort_values` works on both :-P I'm kidding :-) – Corralien Feb 02 '23 at 17:30
  • Yes it does! `df.loc[df['date'].astype('string').groupby(df['item_id'], sort=False).idxmax()]` or `df.loc[df.convert_dtypes().groupby('item_id', sort=False)['date'].idxmax()]` :p – mozway Feb 02 '23 at 17:54