0

Suppose we have a dataframe with the following format.

  1. There are three columns namely, Date, ID and State
  2. The ID may not be unique for every instance and may repeat. For example, 1, 2, 2, 1, 4, 5, 4, ...... etc.

Now, for a given row of this dataframe, we need to perform the following operation;

Find the maximum value of the variable State over all instances of the ID corresponding to that row over all future dates.

For example ID number 1 repeats at indices 10, 12, 40, 100 with dates 31/3/20, 30/6/20, 30/9/20 and 31/12/20 respectively. Now at index number 10, we need to find the maximum value of the variable State over the indices 10, 12, 40 and 100 because these indices come after the current date which is 31/3/20. At index 12, we need to find the max value of variable State over indices 12, 40 and 100 because they come after the current date which is 30/6/20 and so on.

The dataframe has over 8000 unique ids and more than 1 million observations so using a for loop over rows takes too much time. Is there a quick way to do this?

zee1231
  • 1
  • 1

1 Answers1

0
df = df.sort_values('Date', ascending=False).drop_duplicates(['ID'])

This gives you the the latest date for each ID.

pedro_bb7
  • 1,601
  • 3
  • 12
  • 28
  • ok so after looking at the latest date for each ID, how do I iterate over future dates to find the max value for that ID? Could you elaborate a little as I'm not so proficient in python? – zee1231 Jan 10 '22 at 13:52