4

I have the following dataframe:

Disease     HeartRate   State    MonthStart   MonthEnd    
Covid       89          Texas    2020-02-28   2020-03-31      
Covid       91          Texas    2020-03-31   2020-04-30     
Covid       87          Texas    2020-07-31   2020-08-30      
Cancer      90          Texas    2020-02-28   2020-03-31 
Cancer      88          Florida  2020-03-31   2020-04-30      
Covid       89          Florida  2020-02-28   2020-03-31      
Covid       87          Florida  2020-03-31   2020-04-30      
Flu         90          Florida  2020-02-28   2020-03-31        

I have to subtract ‘previous row’ from the ‘current row’ in the ‘Heart’ column and create a new one.

However, there are some conditions:

  1. Row values will be subtracted only when the 'Disease' and 'State' columns have the same values.
  2. Row values will be subtracted only when the rows are in consecutive month. If there is a break in timeline, values won't be subtracted.
  3. If there is no previous row values to subtract, then put the 'HeartRate' value only.

Desired output:

Disease     HeartRate   State    MonthStart   MonthEnd     HeartRateDiff
Covid       89          Texas    2020-02-28   2020-03-31    89      
Covid       91          Texas    2020-03-31   2020-04-30    2     
Covid       87          Texas    2020-07-31   2020-08-30    87      
Cancer      90          Texas    2020-02-28   2020-03-31    90 
Cancer      88          Florida  2020-03-31   2020-04-30    88          
Covid       89          Florida  2020-02-28   2020-03-31    89      
Covid       87          Florida  2020-03-31   2020-04-30    -2      
Flu         90          Florida  2020-02-28   2020-03-31    90      

I know how to subtract previous row from the current row using the following code:

df[‘DiffHeartRate’] = df.groupby(['Disease', 'State'])['HeartRate'].transform(pd.Series.diff)

However, I’m facing two problems:

  1. Keeping the same value if there is no previous row to subtract.
  2. Checking the continuity of timeline (next month or not).

Is there a smarter way of doing it? Any help would be appreciated. Thanks!

Roy
  • 924
  • 1
  • 6
  • 17
  • this is a strange operation. Also, your dates look out of order – anon01 May 07 '21 at 05:20
  • Have you considered the case of year break ? Add 2 lines to your sample data: `Covid 92 Texas 2020-12-31 2020-01-31` and `Covid 93 Texas 2021-01-31 2021-02-28 `. These 2 rows are in consecutive months and should show `HearRateDiff` of `1`. You can see which solutions give the correct result then. – SeaBean May 08 '21 at 20:27
  • Actually, solutions that check for consecutive months by using the 2 month numbers (e.g. 1 for January and 12 for December) differ by 1 would simply fail to handle year break (month number going from 12 to 1) and give wrong result. – SeaBean May 08 '21 at 21:03
  • That's a nice observation @SeaBean. Thank you for informing. Much appreciated. However, would you please take a look at this problem: https://stackoverflow.com/questions/67453646/create-new-dataframe-using-multiple-conditions-across-different-timeline-and-loc – Roy May 09 '21 at 02:03

4 Answers4

2

I've used a combination of groupby and np.where and df.fillna() to accomplish your tasks.

There may be more efficient methods but I hope this helps.

Input the df

Disease HeartRate   State   MonthStart  MonthEnd
0   Covid   89  Texas   2020-02-28  2020-03-31
1   Covid   91  Texas   2020-03-31  2020-04-30
2   Covid   87  Texas   2020-07-31  2020-08-30
3   Cancer  90  Texas   2020-02-28  2020-03-31
4   Cancer  88  Florida 2020-03-31  2020-04-30
5   Covid   89  Florida 2020-02-28  2020-03-31
6   Covid   87  Florida 2020-03-31  2020-04-30
7   Flu 90  Florida 2020-02-28  2020-03-31

Get HeartRateDiff just like you did

df['DiffHeartRate'] = df.groupby(['Disease', 'State'])['HeartRate'].transform(pd.Series.diff)

For the consecutive months, I would add previous month value as a column

Then simply check whether the months are consecutive or not using np.where

df['MonthStart'] = pd.to_datetime(df['MonthStart'])
df['PrevMonth'] = df['MonthStart'].shift().dt.month
df['DiffHeartRateFinal'] = np.where(df['PrevMonth']==df['MonthStart'].dt.month-1, df['DiffHeartRate'], df['HeartRate'])

Finally, fill all NAN with HeartRate instead

df['DiffHeartRateFinal'] = df['DiffHeartRateFinal'].fillna(df['HeartRate'])

Output

Disease HeartRate   State   MonthStart  MonthEnd    DiffHeartRateFinal
Covid   89  Texas   2020-02-28  2020-03-31  89.0
Covid   91  Texas   2020-03-31  2020-04-30  2.0
Covid   87  Texas   2020-07-31  2020-08-30  87.0
Cancer  90  Texas   2020-02-28  2020-03-31  90.0
Cancer  88  Florida 2020-03-31  2020-04-30  88.0
Covid   89  Florida 2020-02-28  2020-03-31  89.0
Covid   87  Florida 2020-03-31  2020-04-30  -2.0
Flu     90  Florida 2020-02-28  2020-03-31  90.0
Shubham Periwal
  • 2,198
  • 2
  • 8
  • 26
2

Try:

import numpy as np

df.MonthStart = pd.to_datetime(df.MonthStart)
df.MonthEnd = pd.to_datetime(df.MonthEnd)


def cal_diff(x):
    x['DiffHeartRate'] = np.where(x['MonthEnd'].shift().dt.month.eq(
        x['MonthStart'].dt.month), x['HeartRate'].diff(), x['HeartRate'])
    return x


df = df.groupby(['Disease', 'State']).apply(cal_diff)

Output

  Disease  HeartRate    State MonthStart   MonthEnd DiffHeartRate
0   Covid         89    Texas 2020-02-28 2020-03-31            89
1   Covid         91    Texas 2020-03-31 2020-04-30             2
2   Covid         87    Texas 2020-07-31 2020-08-30            87
3  Cancer         90    Texas 2020-02-28 2020-03-31            90
4  Cancer         88  Florida 2020-03-31 2020-04-30            88
5   Covid         89  Florida 2020-02-28 2020-03-31            89
6   Covid         87  Florida 2020-03-31 2020-04-30            -2
7     Flu         90  Florida 2020-02-28 2020-03-31            90
Nk03
  • 14,699
  • 2
  • 8
  • 22
2

You may try something like this:

df['DiffHeartRate']=(df.groupby(['Disease', 'State', 
          (df.MonthStart.dt.month.ne(df.MonthStart.dt.month.shift()+1)).cumsum()])['HeartRate']
 .apply(lambda x: x.diff())).fillna(df.HeartRate)

    Disease HeartRate   State   MonthStart  MonthEnd    DiffHeartRate
0   Covid   89          Texas   2020-02-28  2020-03-31  89.0
1   Covid   91          Texas   2020-03-31  2020-04-30  2.0
2   Covid   87          Texas   2020-07-31  2020-08-30  87.0
3   Cancer  90          Texas   2020-02-28  2020-03-31  90.0
4   Cancer  88          Florida 2020-03-31  2020-04-30  88.0
5   Covid   89          Florida 2020-02-28  2020-03-31  89.0
6   Covid   87          Florida 2020-03-31  2020-04-30  -2.0
7   Flu     90          Florida 2020-02-28  2020-03-31  90.0

Logic is same as the other answers but different way of representing.

Pygirl
  • 12,969
  • 5
  • 30
  • 43
  • Thank you so much, @Pygirl. The 'ne' idea is great. Learned that. Also, fillna() option gives me opportunity to try difference values. Appreciate! – Roy May 09 '21 at 01:57
  • Hi @Pygirl. How would you approach for this problem: https://stackoverflow.com/questions/67453646/create-new-dataframe-using-multiple-conditions-across-different-timeline-and-loc – Roy May 09 '21 at 02:52
2

You can do it by .mask() together with .groupby() and .transform() as follows:

df['HeartRateDiff'] = (df['HeartRate'].mask(
                           df['MonthStart'].groupby([df['Disease'], df['State']]).transform('diff').lt(np.timedelta64(2,'M')),
                           df.groupby(['Disease', 'State'])['HeartRate'].transform('diff')
                           )
                      )

Details:

(1) Firstly, we ensure the date columns are of datetime format instead of strings:

You can skip this step if your date columns are already in datetime format.

df['MonthStart'] = pd.to_datetime(df['MonthStart'])
df['MonthEnd'] = pd.to_datetime(df['MonthEnd'])

(2) The HeartRate change (within group) is obtained by:

df.groupby(['Disease', 'State'])['HeartRate'].transform('diff')

We can simply use 'diff' within .transform() instead of using pd.Series.diff to achieve the same result.

(3) Continuity of timeline (next month or not) is checked by the following condition:

df['MonthStart'].groupby([df['Disease'], df['State']]).transform('diff').lt(np.timedelta64(2,'M'))

We check the time difference with previous date (within group) being strictly less than 2 months to ensure it is in the next month. We cannot check <= 1 month since some date difference of 2 consecutive month begins can be 32 days. Note that this checking also works for year break (from December to January) where logics checking only with month figure (from 12 to 1) will give wrong result.

(4) Finally, we get the new column by using .mask() on the existing column HeartRate:

.mask() tests for the condition in its 1st parameter and replaces rows to values in its 2nd parameter when the condition is true. It retains the original values for rows when the condition is not met. Thus, achieving our goal of conditional replacement of values.

Output:

  Disease  HeartRate    State MonthStart   MonthEnd  HeartRateDiff
0   Covid         89    Texas 2020-02-28 2020-03-31             89
1   Covid         91    Texas 2020-03-31 2020-04-30              2
2   Covid         87    Texas 2020-07-31 2020-08-30             87
3  Cancer         90    Texas 2020-02-28 2020-03-31             90
4  Cancer         88  Florida 2020-03-31 2020-04-30             88
5   Covid         89  Florida 2020-02-28 2020-03-31             89
6   Covid         87  Florida 2020-03-31 2020-04-30             -2
7     Flu         90  Florida 2020-02-28 2020-03-31             90
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • It's so awesome @SeaBean. The way you explain makes me understand the problem better. Also, learned the use of 'lt'. – Roy May 09 '21 at 02:01
  • @Roy Pleased to help! I think you have to pick a solution that works for year break since it is quite certain that such situation will happen in your data. Would you reconsider the solution you accept as answer ? – SeaBean May 09 '21 at 05:13
  • Hi @Roy Have you finally reconsidered to use a solution that can handle year break ? It's very likely you will meet such case in your data. If yes, please choose again and pick a solution that work for you. I insist here since I don't want subsequent people with similar problem as you would pick a wrong solution by looking at the accepted solution. Thanks for your co-operation! – SeaBean May 12 '21 at 17:44