0

Same question as here: group by pandas dataframe and select latest in each group, except instead of latest date, would like to get next upcoming date for each group.

So given a dataframe sorted by date:

 id     product  date
0   220    6647     2020-09-01 
1   220    6647     2020-10-03 
2   220    6647     2020-12-16
3   826    3380     2020-11-11
4   826    3380     2020-12-09
5   826    3380     2021-05-19
6   901    4555     2020-09-01
7   901    4555     2020-12-01
8   901    4555     2021-11-01

Using todays date (2020-12-01) to determine the next upcoming date, grouping by id or product and selecting the the next upcoming date should give:

    id     product   date
2   220    6647     2020-12-16
5   826    3380     2020-12-09
8   901    4555     2021-11-01
DataScope
  • 29
  • 5

1 Answers1

0

Filter the dates first, then drop duplicates:

df[df['date']>'2020-12-01'].sort_values(['id','date']).drop_duplicates('id')

Output:

    id  product        date
2  220     6647  2020-12-16
4  826     3380  2020-12-09
8  901     4555  2021-11-01
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thank you. Can it be done without hard coding today's date? – DataScope Dec 01 '20 at 18:01
  • Use `pd.Timestamp.now().normalize()` instead of hard coding. Make sure your `date` column is datetime type. – Quang Hoang Dec 01 '20 at 18:02
  • this does not seem to be working. with hardcoding or with pd.timestamp.now().normalize(). no errors, just not dropping duplicates – DataScope Dec 01 '20 at 19:42
  • Did you forget to assign back? – Quang Hoang Dec 01 '20 at 19:58
  • yes! that fixed it. thank you. should note that -- df[df['date']>'2020-12-01'].sort_values(['id','date']).drop_duplicates('id') -- will delete any rows that have a blank field in the date column. i can fix that but just want to give a heads up for anyone else looking at this page. – DataScope Dec 01 '20 at 20:13