0

I have the below dataframe called df:

Id Stage1 Stage2 Stage3
1 2022-02-01 2020-04-03 2022-06-07
--- ------------ ------------ -----------
2 2023-06-07 2020-03-01 2020-09-03
--- ------------ ------------ -----------
3 2023-02-04 2023-06-07 2022-06-07
--- ------------ ------------ -----------
4 2022-05-08 2023-09-01 2023-09-01

I need to calculate the max date for each ID and its respective Stage. So for Order 1,2,3 the Stages I need are Stage 3, Stage 1 and Stage 2 respectively. I started this process by calculating the max date in each row first with the below code:

df2 = df[['Stage1', 'Stage2', 'Stage3', 'Stage4', 'Stage5']]
lis = list(df2.max(axis=1))

lis variable has the max dates stored for each row. Now, with each max date, I need to get the Stage Name of that row.

The below code calculates the max Stage for the whole df and not row.

new_lis = []
for i in lis:
    new_lis.append(df.columns[df.isin([i]).any()])

How do I fix this? Output I need is "Stage 3", "Stage 1" and "Stage 2" for Order 1,2,3 respectively.

Sriya TR
  • 29
  • 7

1 Answers1

3

Let's try idxmax(axis=1)

out = (df.filter(like='Stage')
       .apply(pd.to_datetime)
       .idxmax(axis=1))
print(out)

0    Stage3
1    Stage1
2    Stage2
dtype: object

If your stage columns contain NaT for the whole row, you can drop this row

out = (df.filter(like='Stage')
       .apply(pd.to_datetime)
       .dropna(how='all')
       .idxmax(axis=1))
input dataframe

   Id      Stage1      Stage2      Stage3
0   1  2022-02-01  2020-04-03  2022-06-07
1   2  2023-06-07  2020-03-01  2020-09-03
2   3  2023-02-04  2023-06-07  2022-06-07
3   4         NaN         NaN         NaN
4   5         NaT  2023-06-07  2022-06-07

output dataframe, note the index 3 is dropped

0    Stage3
1    Stage1
2    Stage2
4    Stage2
dtype: object
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
  • Thank you so much! For some Id's, I have null values for all Stages. For these rows, the code you provided I am getting this error: ValueError: attempt to get argmax of an empty sequence. How do I fix this? – Sriya TR Sep 19 '22 at 17:45
  • @SriyaTR How is your null value like? – Ynjxsjmh Sep 19 '22 at 17:52
  • Null values have this value "NaT" – Sriya TR Sep 19 '22 at 17:53
  • @SriyaTR I don't think NaT will be the cause of your error. NaT is NaN for datetime. Consider provide the raw line. – Ynjxsjmh Sep 19 '22 at 17:55
  • @SriyaTR I understand the error, what do you expect to get for the whole NaT rows? – Ynjxsjmh Sep 19 '22 at 18:04
  • Hi, I have a quick question. For Order 4, I have the same date for Stage 2 and 3. Using the code you provided, I am getting the answer as Stage 2 as its the first occurrence of the max date but I need Stage 3 as my answer as its the latest stage. How would this be possible? – Sriya TR Sep 20 '22 at 10:03
  • @SriyaTR Maybe you can try https://stackoverflow.com/a/62001586/10315163 – Ynjxsjmh Sep 20 '22 at 10:09
  • df[::-1].idxmax() just reverses the dataframe. It puts the first index at the bottom of the dataframe and the last index at the top of the df – Sriya TR Sep 20 '22 at 10:18
  • @SriyaTR You can use `df.filter(like='Stage')[lambda df: df.columns[::-1]]` to reverse columns oder. – Ynjxsjmh Sep 20 '22 at 10:33