group by pandas dataframe and select latest in each group

Question

How to group values of pandas dataframe and select the latest(by date) from each group?

For example, given a dataframe sorted by date:

    id     product   date
0   220    6647     2014-09-01 
1   220    6647     2014-09-03 
2   220    6647     2014-10-16
3   826    3380     2014-11-11
4   826    3380     2014-12-09
5   826    3380     2015-05-19
6   901    4555     2014-09-01
7   901    4555     2014-10-05
8   901    4555     2014-11-01

grouping by id or product, and selecting the earliest gives:

    id     product   date
2   220    6647     2014-10-16
5   826    3380     2015-05-19
8   901    4555     2014-11-01

score 160 · Answer 1 · answered Jan 08 '17 at 09:11

160

You can also use tail with groupby to get the last n values of the group:

df.sort_values('date').groupby('id').tail(1)

    id  product date
2   220 6647    2014-10-16
8   901 4555    2014-11-01
5   826 3380    2015-05-19

answered Jan 08 '17 at 09:11

nipy

5,138
5
31
72

9

I like this because it can be applied to more than just dates. – scottlittle Feb 14 '18 at 16:28
1

This option is significantly faster, than the accepted answer, but is less readable. Also isn't it a problematic, that there is an assumption that `groupby` preserves order? – Michael D Jan 01 '20 at 14:54
4

groupby preserves order, see https://stackoverflow.com/questions/26456125/python-pandas-is-order-preserved-when-using-groupby-and-agg – Martien Lubberink Apr 12 '20 at 02:13
@ade1e how would the code change to perform a resample (say per month or year) and keep the last n values of the group, rather than summing/averaging? – Andreuccio Oct 07 '20 at 13:14
2

I find this answer much more readable than the accepted one @MichaelD :) – Mr_and_Mrs_D Dec 21 '21 at 12:53
For haters of groupby: `df.sort_values('date').drop_duplicates('id', keep='last')` – Alex Li Aug 04 '23 at 19:19

score 74 · Accepted Answer · answered Jan 07 '17 at 20:06

74

use idxmax in groupby and slice df with loc

df.loc[df.groupby('id').date.idxmax()]

    id  product       date
2  220     6647 2014-10-16
5  826     3380 2015-05-19
8  901     4555 2014-11-01

answered Jan 07 '17 at 20:06

piRSquared

285,575
57
475
624

8

the solution works very slow for millions of records – Hardik Gupta Jan 03 '20 at 10:30
In 2021 I get this error: KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. – kame Apr 08 '21 at 06:21
Use reindex instead – piRSquared Apr 08 '21 at 06:24
will this not work if date column has `NaT`s? – Naveen Reddy Marthala Jul 02 '21 at 08:03

Damien Marlier · Answer 3 · 2019-05-24T03:22:17.927

37

I had a similar problem and ended up using drop_duplicates rather than groupby.

It seems to run significatively faster on large datasets when compared with other methods suggested above.

df.sort_values(by="date").drop_duplicates(subset=["id"], keep="last")

    id  product        date
2  220     6647  2014-10-16
8  901     4555  2014-11-01
5  826     3380  2015-05-19

edited May 24 '19 at 03:22

answered May 24 '19 at 03:17

Damien Marlier

436
5
6

4

I typically use this as well, but wish the faster solution was with the groupby. The groupby intuitively makes more sense and is usually how we think about solving this problem! – rmilletich Dec 10 '19 at 06:44
This approach, however, only works if you want to keep 1 record per group, rather than N records when using `tail` as per @nipy's answer – npetrov937 Dec 05 '22 at 09:42

score 25 · Answer 4 · edited Jul 30 '19 at 18:55

25

Given a dataframe sorted by date, you can obtain what you ask for in a number of ways:

Like this:

df.groupby(['id','product']).last()

like this:

df.groupby(['id','product']).nth(-1)

or like this:

df.groupby(['id','product']).max()

If you don't want id and product to appear as index use groupby(['id', 'product'], as_index=False). Alternatively use:

df.groupby(['id','product']).tail(1)

edited Jul 30 '19 at 18:55

Community

1
1

answered Jun 03 '19 at 09:59

Sandu Ursu

1,181
1
18
28

7

In my tests, last() behaves a bit differently than nth(), when there are None values in the same column. For example, if first row in a group has the value 1 and the rest of the rows in the same group all have None, last() will return 1 as the value, although the last row has None. On the other hand nth(-1) will return None, which is more like what I expect. – Canol Gökel Oct 05 '21 at 15:25

score 4 · Answer 5 · answered Apr 29 '19 at 16:11

To use .tail() as an aggregation method and keep your grouping intact:

df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))

        id  product date
id              
220 2   220 6647    2014-10-16
826 5   826 3380    2015-05-19
901 8   901 4555    2014-11-01

navarro · Answer 6 · 2022-09-27T13:21:25.977

0

#import datetime library
from datetime import datetime as dt

#transform the date column to ordinal, or create a temp column converting to ordinal.
df['date'] = df.date.apply(lambda date: date.toordinal())

#apply aggregation function depending your desire. Earliest or Latest date.
latest_date = df.groupby('id').agg(latest=('date', max)) 
earliest_date = df.groupby('id').agg(earliest=('date', min)) 

#convert it from ordinal back to date.
df['date'] = df.date.apply(lambda date: dt.fromordinal(date))


#This operation may take seconds on millions of records.

edited Sep 27 '22 at 13:21

answered Sep 27 '22 at 13:21

navarro

1
1

Well, you certainly don't need to perform this conversion just to find the latest or earliest date. – AlexK Sep 30 '22 at 19:49

group by pandas dataframe and select latest in each group

6 Answers6

Linked

Related