2

Below is a sample data frame:

df = pd.DataFrame({'StudentName': ['Anil','Ramu','Ramu','Anil','Peter','Peter','Anil','Ramu','Peter','Anil'],
                   'ExamDate': ['2021-01-10','2021-01-20','2021-02-22','2021-03-30','2021-01-04','2021-06-06','2021-04-30','2021-07-30','2021-07-08','2021-09-07'],
                   'Result': ['Fail','Pass','Fail','Pass','Pass','Pass','Pass','Pass','Fail','Pass']})

  StudentName    ExamDate Result
0        Anil  2021-01-10   Fail
1        Ramu  2021-01-20   Pass
2        Ramu  2021-02-22   Fail
3        Anil  2021-03-30   Pass
4       Peter  2021-01-04   Pass
5       Peter  2021-06-06   Pass
6        Anil  2021-04-30   Pass
7        Ramu  2021-07-30   Pass
8       Peter  2021-07-08   Fail
9        Anil  2021-09-07   Pass

For each row, I would like to calculate the number of days it has been since that student's last failed test:

df = pd.DataFrame({'StudentName': ['Anil','Ramu','Ramu','Anil','Peter','Peter','Anil','Ramu','Peter','Anil'],
                   'ExamDate': ['2021-01-10','2021-01-20','2021-02-22','2021-03-30','2021-01-04','2021-06-06','2021-04-30','2021-07-30','2021-07-08','2021-09-07'],
                   'Result': ['Fail','Pass','Fail','Pass','Pass','Pass','Pass','Pass','Fail','Pass'],
                   'LastFailedDays': [0, 0, 0, 79, 0, 0, 110, 158, 0, 240]})

  StudentName    ExamDate Result  LastFailedDays
0        Anil  2021-01-10   Fail               0
1        Ramu  2021-01-20   Pass               0
2        Ramu  2021-02-22   Fail               0
3        Anil  2021-03-30   Pass              79
4       Peter  2021-01-04   Pass               0
5       Peter  2021-06-06   Pass               0
6        Anil  2021-04-30   Pass             110
7        Ramu  2021-07-30   Pass             158
8       Peter  2021-07-08   Fail               0
9        Anil  2021-09-07   Pass             240

For example:

  • Anil failed on 2021-01-10, so for that row it will be zero days.
  • Anil's next record, which is successful, is on 2021-03-30, so the number of days for that row will be the number of days from his previous failed date 2021-01-10 to 2021-03-30, which is 79 days.
  • Anil's third record, which is also successful, is on 2021-04-30, so the number of days there will be again, the number of days 2021-01-10 (his last failed date) to 2021-04-30, which is 110 days.

It is doable with regular loops but I am looking for a more conventional Pandas solution. I'm guessing it's possible with groupby.

tdy
  • 36,675
  • 19
  • 86
  • 83
Amey Kumar Samala
  • 904
  • 1
  • 7
  • 20
  • cannot understand how you calculate LastFailedDays. anil for example failed in 21-01-10, 21-03-30, 21-04-30 and 21-09-07 how does this last day counts 240 days? something like days since first failing? – Ulises Bussi Nov 18 '21 at 17:24
  • I've explained the scenario with an example. @UlisesBussi – Amey Kumar Samala Nov 18 '21 at 17:49

2 Answers2

2

TL;DR

Use Series.where and groupby.ffill to generate each student's last failed date and subtract from ExamDate to get LastFailedDays:

df['ExamDate'] = pd.to_datetime(df['ExamDate'])

df['LastFailedDays'] = (df['ExamDate'].sub(
    df['ExamDate'].where(df['Result'] == 'Fail').groupby(df['StudentName']).ffill()
).dt.days.fillna(0))

#   StudentName    ExamDate  Result  LastFailedDays
# 0        Anil  2021-01-10    Fail             0.0
# 1        Ramu  2021-01-20    Pass             0.0
# 2        Ramu  2021-02-22    Fail             0.0
# 3        Anil  2021-03-30    Pass            79.0
# 4       Peter  2021-01-04    Pass             0.0
# 5       Peter  2021-06-06    Pass             0.0
# 6        Anil  2021-04-30    Pass           110.0
# 7        Ramu  2021-07-30    Pass           158.0
# 8       Peter  2021-07-08    Fail             0.0
# 9        Anil  2021-09-07    Pass           240.0

Re: comments, to group by multiple columns, e.g. StudentClass and StudentName, use a list as the grouper:

...groupby([df['StudentClass'], df['StudentName']]).ffill()

Details

  1. Convert to_datetime:

    df['ExamDate'] = pd.to_datetime(df['ExamDate'])
    
  2. Use Series.where to generate each student's last failed date (here I've made it a column for easier visualization):

    df['LastFailedDate'] = df['ExamDate'].where(df['Result'] == 'Fail')
    
    #   StudentName    ExamDate  Result  LastFailedDate
    # 0        Anil  2021-01-10    Fail      2021-01-10
    # 1        Ramu  2021-01-20    Pass             NaT
    # 2        Ramu  2021-02-22    Fail      2021-02-22
    # 3        Anil  2021-03-30    Pass             NaT
    # 4       Peter  2021-01-04    Pass             NaT
    # 5       Peter  2021-06-06    Pass             NaT
    # 6        Anil  2021-04-30    Pass             NaT
    # 7        Ramu  2021-07-30    Pass             NaT
    # 8       Peter  2021-07-08    Fail      2021-07-08
    # 9        Anil  2021-09-07    Pass             NaT
    
  3. Use groupby.ffill to forward-fill the last failed date for each student (NaT if no previous failed exam):

    df['LastFailedDate'] = df['LastFailedDate'].groupby(df['StudentName']).ffill()
    
    #   StudentName    ExamDate  Result  LastFailedDate
    # 0        Anil  2021-01-10    Fail      2021-01-10
    # 1        Ramu  2021-01-20    Pass             NaT
    # 2        Ramu  2021-02-22    Fail      2021-02-22
    # 3        Anil  2021-03-30    Pass      2021-01-10
    # 4       Peter  2021-01-04    Pass             NaT
    # 5       Peter  2021-06-06    Pass             NaT
    # 6        Anil  2021-04-30    Pass      2021-01-10
    # 7        Ramu  2021-07-30    Pass      2021-02-22
    # 8       Peter  2021-07-08    Fail      2021-07-08
    # 9        Anil  2021-09-07    Pass      2021-01-10
    
  4. Finally subtract the exam dates by the last failed dates and use dt.days to extract the number of days:

    df['LastFailedDays'] = df['ExamDate'].sub(df['LastFailedDate']).dt.days.fillna(0)
    
    #   StudentName    ExamDate  Result  LastFailedDate  LastFailedDays
    # 0        Anil  2021-01-10    Fail      2021-01-10             0.0
    # 1        Ramu  2021-01-20    Pass             NaT             0.0
    # 2        Ramu  2021-02-22    Fail      2021-02-22             0.0
    # 3        Anil  2021-03-30    Pass      2021-01-10            79.0
    # 4       Peter  2021-01-04    Pass             NaT             0.0
    # 5       Peter  2021-06-06    Pass             NaT             0.0
    # 6        Anil  2021-04-30    Pass      2021-01-10           110.0
    # 7        Ramu  2021-07-30    Pass      2021-02-22           158.0
    # 8       Peter  2021-07-08    Fail      2021-07-08             0.0
    # 9        Anil  2021-09-07    Pass      2021-01-10           240.0
    
tdy
  • 36,675
  • 19
  • 86
  • 83
  • 1
    In step 3, using `groupby.ffill` is it possible to groupby more than one column ? Something like this, `df['LastFailedDate'] = df['LastFailedDate'].groupby(df[['StudentClass', 'StudentName']]).ffill()` It is throwing error, so I tried `df['LastFailedDate'] = df.groupby(['StudentClass', 'StudentName'])['LastFailedDate'].ffill()` But now I am wondering how to include it in pythonic oneline code you have written on top. – Amey Kumar Samala Nov 23 '21 at 05:46
  • @AmeyYadav you can pass a `list` of 2 series as the grouper: `groupby([df.StudentClass, df.StudentName]).ffill()` – tdy Nov 23 '21 at 06:29
  • 1
    so the oneliner would be: `df['LastFailedDays'] = df.ExamDate.where(df.Result == 'Fail').groupby([df.StudentClass, df.StudentName]).ffill().pipe(lambda fail_date: df.ExamDate.sub(fail_date).dt.days.fillna(0))` – tdy Nov 23 '21 at 06:34
1

I've finally come up with a solution that works.

# Process the data a bit
df['Tmp_Result'] = df['Result'].map({'Pass': 1, 'Fail': 0})
df['ExamDate'] = pd.to_datetime(df['ExamDate'])

# Create a mask that will be used to group the rows by StudentName + consecutive passed tests after a failed test (including the failed test)
sorted_df = df.sort_values(['StudentName', 'ExamDate']) 
mask = sorted_df.groupby('StudentName')['Tmp_Result'].diff().ne(0).cumsum()
mask[(sorted_df['Tmp_Result'].eq(0) & ~(pd.isna(sorted_df.groupby('StudentName')['Tmp_Result'].shift(-1))))] += 1

df['LastFailedDays'] = df.groupby(mask)['ExamDate'].diff().fillna(pd.Timedelta(0))
df['LastFailedDays'] = df.groupby(mask)['LastFailedDays'].cumsum()

# Cleanup
df = df.drop('Tmp_Result', axis=1)

Output:

>>> df
  StudentName   ExamDate Result LastFailedDays
0        Anil 2021-01-10   Fail         0 days
1        Ramu 2021-01-20   Pass         0 days
2        Ramu 2021-02-22   Fail         0 days
3        Anil 2021-03-30   Pass        79 days
4       Peter 2021-01-04   Pass         0 days
5       Peter 2021-06-06   Pass       153 days
6        Anil 2021-04-30   Pass       110 days
7        Ramu 2021-07-30   Pass       158 days
8       Peter 2021-07-08   Fail         0 days
9        Anil 2021-09-07   Pass       240 days

>>> df.sort_values(['StudentName', 'ExamDate'])
  StudentName   ExamDate Result LastFailedDays
0        Anil 2021-01-10   Fail         0 days
3        Anil 2021-03-30   Pass        79 days
6        Anil 2021-04-30   Pass       110 days
9        Anil 2021-09-07   Pass       240 days
4       Peter 2021-01-04   Pass         0 days
5       Peter 2021-06-06   Pass       153 days
8       Peter 2021-07-08   Fail         0 days
1        Ramu 2021-01-20   Pass         0 days
2        Ramu 2021-02-22   Fail         0 days
7        Ramu 2021-07-30   Pass       158 days

It's a bit gruesome to the eyes, but because it's vectorized, it should be a lot faster than any solution using loops.