How can I count the number of times an entry is a repeat of the previous entry within a column while grouping by another column in Python?

Question

Consider the following table for example:

import pandas as pd

data = {'Group':["AGroup", "AGroup", "AGroup", "AGroup", "BGroup", "BGroup", "BGroup", "BGroup", "CGroup", "CGroup", "CGroup", "CGroup"],
        'Status':["Low", "Low", "High", "High", "High", "Low", "High", "Low", "Low", "Low", "High", "High"],
        'CountByGroup':[1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2]}

pd.DataFrame(data)

This creates the following table:

Group   Status  CountByGroup
AGroup  Low     1
AGroup  Low     2
AGroup  High    1
AGroup  High    2
BGroup  High    1
BGroup  Low     1
BGroup  High    1
BGroup  Low     1
CGroup  Low     1
CGroup  Low     2
CGroup  High    1
CGroup  High    2

The CountByGroup column is what I am trying to create. Here you can see that "Low" appeared once so far for the "AGroup" in the first row, so it has an entry of 1. "Low" directly follows the same entry "Low" in the second row, so it has an entry of 2. If it were to appear a third time in a row in the third row, CountByGroup would display an entry of 3.

We're also grouping these "Group", so the first entry for a new group is always 1 since it is the first time any entry has appeared for the group.

This was solved in a previous question I had using R which is available here, but I'm not sure how to solve this using Python.

Do you have an instance where `Status` is `Low, Low, High, High, Low` within a group? If so, what's the expected output? — Quang Hoang, Mar 07 '23 at 21:17
Instances of Low, Low, High, High, Low could occur within AGroup for example. The CountByGroup entries would be 1, 2, 1, 2, 1. Additionally, a status could appear more than two times in a group (e.g., Low, Low, Low, High) in which case CountByGroup would be 1, 2, 3, 1. — GM01, Mar 07 '23 at 21:31

Corralien · Answer 1 · 2023-03-07T20:42:05.207

Assuming your dataframe is correctly sorted or use df.sort_values('Group', kind='stable') before:

cols = ['Group', 'Status']
df['CountByGroup2'] = df[cols].eq(df[cols].shift()).all(axis=1).astype(int).add(1)
print(df)

# Output
     Group Status  CountByGroup  CountByGroup2
0   AGroup    Low             1              1
1   AGroup    Low             2              2
2   AGroup   High             1              1
3   AGroup   High             2              2
4   BGroup   High             1              1
5   BGroup    Low             1              1
6   BGroup   High             1              1
7   BGroup    Low             1              1
8   CGroup    Low             1              1
9   CGroup    Low             2              2
10  CGroup   High             1              1
11  CGroup   High             2              2

This works for the example, but as mentioned a status can appear more than two consecutive times within a group (e.g., AGroup entries of Low, Low, Low, Low) in which case the CountByGroup entries for this group would be 1, 2, 3, 4. This solution appears to only count up to 2. — GM01, Mar 07 '23 at 21:34

Quang Hoang · Accepted Answer · 2023-03-07T21:44:03.400

Usually, you would use cumsum on the row differences to identify continuous blocks. Then you can group by the Group and the Block then cumcount:

# assuming data is sorted by Group as in the example
blocks = df['Status'].ne(df['Status'].shift()).cumsum()
df['CountByGroup'] = df.groupby(['Group', blocks]).cumcount() + 1

Note if the data is not sorted by Group, you would need to sort before creating the blocks:

blocks = df['Status'].ne(df.sort_values('Group', kind='stable')['Status'].shift()).cumsum()

or a groupby:

blocks = df['Status'].ne(df.groupby('Group')['Status'].shift()).cumsum()

Output:

     Group Status  CountByGroup
0   AGroup    Low             1
1   AGroup    Low             2
2   AGroup   High             1
3   AGroup   High             2
4   BGroup   High             1
5   BGroup    Low             1
6   BGroup   High             1
7   BGroup    Low             1
8   CGroup    Low             1
9   CGroup    Low             2
10  CGroup   High             1
11  CGroup   High             2

Laurent B. · Answer 3 · 2023-03-07T21:55:16.530

import pandas as pd

data = {'Group':["AGroup", "AGroup", "AGroup", "AGroup", "BGroup", "BGroup", "BGroup", "BGroup", "CGroup", "CGroup", "CGroup", "CGroup"],
        'Status':["Low", "Low", "High", "High", "High", "Low", "High", "Low", "Low", "Low", "High", "High"],
        'CountByGroup':[1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2]}

df = pd.DataFrame(data)

s = df[['Group', 'Status']].groupby(['Group', 'Status']).ngroup()
s = s.groupby(lambda x: s[x]).cumcount()+1

print(pd.concat([df[['Group', 'Status']], s.rename("CountByGroup")], axis=1))

      Group Status  CountByGroup
0   AGroup    Low             1
1   AGroup    Low             2
2   AGroup   High             1
3   AGroup   High             2
4   BGroup   High             1
5   BGroup    Low             1
6   BGroup   High             2
7   BGroup    Low             2
8   CGroup    Low             1
9   CGroup    Low             2
10  CGroup   High             1
11  CGroup   High             2

How can I count the number of times an entry is a repeat of the previous entry within a column while grouping by another column in Python?

3 Answers3