4

I have a question that extends from Pandas: conditional rolling count. I would like to create a new column in a dataframe that reflects the cumulative count of rows that meets several criteria.

Using the following example and code from stackoverflow 25119524

import pandas as pd


l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3))) 

cowmast.columns =['Cow', 'Lact', 'DIM']

def rolling_count(val):
    if val == rolling_count.previous:
        rolling_count.count +=1
    else:
        rolling_count.previous = val
        rolling_count.count = 1
    return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable

cowmast['xmast'] = cowmast['Cow'].apply(rolling_count) #new column in dataframe

cowmast

The output is xmast (number of times mastitis) for each cow


  Cow Lact DIM xmast
0   1   1   45  1
1   1   2   25  2
2   1   2   28  3
3   2   2   70  1
4   2   2   95  2
5   2   2   98  3
6   2   2   120 4
7   2   3   80  5

What I would like to do is restart the count for each cow (cow) lactation (Lact) and only increment the count when the number of days (DIM) between rows is more than 7.

To incorporate more than one condition to reset the count for each cows lactation (Lact) I used the following code.


def count_consecutive_items_n_cols(df, col_name_list, output_col):
    cum_sum_list = [
        (df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
    ]
    df[output_col] = df.groupby(
        ["_".join(map(str, x)) for x in zip(*cum_sum_list)]
    ).cumcount() + 1
    return df

count_consecutive_items_n_cols(cowmast, ['Cow', 'Lact'], ['Lxmast'])

That produces the following output


Cow Lact    DIM xmast   Lxmast
0   1   1   45  1   1
1   1   2   25  2   1
2   1   2   28  3   2
3   2   2   70  1   1
4   2   2   95  2   2
5   2   2   98  3   3
6   2   2   120 4   4
7   2   3   80  5   1

I would appreciate insight as to how to add another condition in the cumulative count that takes into consideration the time between mastitis events (difference in DIM between rows for cows within the same Lact). If the difference in DIM between rows for the same cow and lactation is less than 7 then the count should not increment.

The output I am looking for is called "Adjusted" in the table below.


  Cow Lact DIM xmast Lxmast Adjusted
0   1   1   45  1   1   1
1   1   2   25  2   1   1
2   1   2   28  3   2   1
3   2   2   70  1   1   1
4   2   2   95  2   2   2
5   2   2   98  3   3   2
6   2   2   120 4   4   3
7   2   3   80  5   1   1

In the example above for cow 1 lact 2 the count is not incremented when the dim goes from 25 to 28 as the difference between the two events is less than 7 days. Same for cow 2 lact 2 when is goes from 95 to 98. For the larger increments 70 to 95 and 98 to 120 the count is increased.

Thank you for your help

John

SeaBean
  • 22,547
  • 3
  • 13
  • 25
JohnH
  • 253
  • 1
  • 11
  • Please include your expected output and try to rephrase your question, it is not clear what you want. – Frank Sep 19 '21 at 13:36
  • Thanks for the feedback Frank, hopefully my edits and example will make it clearer. John – JohnH Sep 19 '21 at 21:02

1 Answers1

5

Actually, your codes to set up xmast and Lxmast can be much simplified if you had used the solution with the highest upvotes in the referenced question.

Renaming your dataframe cowmast to df, you can set up xmast as follows:

df['xmast'] = df.groupby((df['Cow'] != df['Cow'].shift(1)).cumsum()).cumcount()+1

Similarly, to set up Lxmast, you can use:

df['Lxmast'] = (df.groupby([(df['Cow'] != df['Cow'].shift(1)).cumsum(), 
                            (df['Lact'] != df['Lact'].shift()).cumsum()])
                  .cumcount()+1
               )

Data Input

l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3))) 

cowmast.columns =['Cow', 'Lact', 'DIM']

df = cowmast

Output

print(df)

  Cow  Lact  DIM  xmast  Lxmast
0   1     1   45      1       1
1   1     2   25      2       1
2   1     2   28      3       2
3   2     2   70      1       1
4   2     2   95      2       2
5   2     2   98      3       3
6   2     2  120      4       4
7   2     3   80      5       1

Now, continue with the last part of your requirement highlighted in bold below:

What I would like to do is restart the count for each cow (cow) lactation (Lact) and only increment the count when the number of days (DIM) between rows is more than 7.

we can do it as follows:

To make the codes more readable, let's define 2 grouping sequences for the codes we have so far:

m_Cow = (df['Cow'] != df['Cow'].shift()).cumsum()
m_Lact = (df['Lact'] != df['Lact'].shift()).cumsum()

Then, we can rewrite the codes to set up Lxmast in a more readable format, as follows:

df['Lxmast'] = df.groupby([m_Cow, m_Lact]).cumcount()+1

Now, turn to the main works here. Let's say we create another new column Adjusted for it:

df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
                   ['DIM'].diff().abs().gt(7)
                   .groupby([m_Cow, m_Lact])
                   .cumsum()+1
                )

Result:

print(df)

  Cow  Lact  DIM  xmast  Lxmast  Adjusted
0   1     1   45      1       1         1
1   1     2   25      2       1         1
2   1     2   28      3       2         1
3   2     2   70      1       1         1
4   2     2   95      2       2         2
5   2     2   98      3       3         2
6   2     2  120      4       4         3
7   2     3   80      5       1         1

Here, after df.groupby([m_Cow, m_Lact]), we take the column DIM and check for each row's difference with previous row by .diff() and take the absolute value by .abs(), then check whether it is > 7 by .gt(7) in the code fragment ['DIM'].diff().abs().gt(7). We then group by the same grouping again .groupby([m_Cow, m_Lact]) since this 3rd condition is within the grouping of the first 2 conditions. The final step we use .cumsum() on the 3rd condition, so that only when the 3rd condition is true we increment the count.

Just in case you want to increment the count only when the DIM is inreased by > 7 only (e.g. 70 to 78) and exclude the case decreased by > 7 (not from 78 to 70), you can remove the .abs() part in the codes above:

df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
                   ['DIM'].diff().gt(7)
                   .groupby([m_Cow, m_Lact])
                   .cumsum()+1
                )

Edit (Possible simplification depending on your data sequence)

As your sample data have the main grouping keys Cow and Lact somewhat already in sorted sequence, there's opportunity for further simplification of the codes.

Different from the sample data from the referenced question, where:

   col count
0  B   1
1  B   2
2  A   1 # Value does not match previous row => reset counter to 1
3  A   2
4  A   3
5  B   1 # Value does not match previous row => reset counter to 1

Here, the last B in the last row is separated from other B's and it required the count be reset to 1 rather than continuing from the last count of 2 of the previous B (to become 3). Hence, the grouping needs to compare current row with previous row to get the correct grouping. Otherwise, when we use .groupby() and the values of B are grouped together during processing, the count value may not be correctly reset to 1 for the last entry.

If your data for the main grouping keys Cow and Lact are already naturally sorted during data construction, or have been sorted by instruction such as:

df = df.sort_values(['Cow', 'Lact'])

Then, we can simplify our codes, as follows:

(when data already sorted by [Cow, Lact]):

df['xmast'] = df.groupby('Cow').cumcount()+1
df['Lxmast'] = df.groupby(['Cow', 'Lact']).cumcount()+1
               
df['Adjusted'] = (df.groupby(['Cow', 'Lact'])
                    ['DIM'].diff().abs().gt(7)
                    .groupby([df['Cow'], df['Lact']])
                    .cumsum()+1
                 )

Same result and output values in the 3 columns xmast, Lxmast and Adjusted

SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Thank you for your answer and for taking the time to explain how it works. I am new to python so the explanations are very helpful. Much appreciated. John – JohnH Sep 19 '21 at 23:22
  • @user16776897 John, glad the solution works for you. I'm pleased to help! – SeaBean Sep 19 '21 at 23:40
  • @user16776897 John, see my latest edit above if you want to learn more about when should we apply which kind of grouping codes. Here, your sample data could possibly be handled with even simpler codes if they are already in some sorted sequence. Just for your reference for learning more. – SeaBean Sep 20 '21 at 10:49
  • SeaBean Thank you for that it is nice to see how the code can be simplified. I took the original code back to the larger dataset and ran into a snag. I will edit the question above to illustrate – JohnH Sep 20 '21 at 11:48
  • @user16776897 Ok, let me look at it. Will come back in a while as I'm currently engaged in some tasks. – SeaBean Sep 20 '21 at 13:06
  • @user16776897 Hi John, I used your new data in the dict at the bottom of your edit. Got no `NaN` values, with the result. Would you please recheck whether there is any setup issue with your data ? – SeaBean Sep 20 '21 at 13:18
  • @user16776897 Edited my run result into your question (not my answer). Take a look. You can run yourself based on the data, which are simply what your provided, and see if any problem. If no, should be other issues, probably some copy or data set up issue during the run of your data. – SeaBean Sep 20 '21 at 13:28
  • @user16776897 Looking in details your codes with `newm`, it seems no problem on the codes. Maybe you can try replacing your code `df = newm` with `df = newm.copy()` or directly work on `newm` in the code instead of `df` (i.e. replace all `df` with `newn` in the codes above). – SeaBean Sep 20 '21 at 14:49
  • SeaBean, thanks for taking a look. Yes it also worked for me when I recreate the df. When I go back to the original file it gives me the NA values unless I dict out and then recreate the df. There appears to be some re-arrangement of the dataframe in the output and recreation process. Tried the df = newm.copy() and newm directly. I still get the NA. I will keep working on the date setup side to see if I can identify a fix. Thank you for all your help. John – JohnH Sep 20 '21 at 22:25
  • @user16776897 It's quite interesting of this situation. Anyway, if it also works for you when you recreate df, it's no problem related to question and solution here. I suggest to clean up the question by removing the latest edit. This can help future readers of similar question to read the question easier and avoid confusion. Good luck to fixing the root cause of data problem. – SeaBean Sep 20 '21 at 22:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/237303/discussion-between-user16776897-and-seabean). – JohnH Sep 21 '21 at 00:10