Cumulative count between two columns in a pandas df

Question

I am trying to return a cumulative count of the amount of times there's a change of values in a column.

So for the df below, I want to return a running count of the times that 'Home' changes to 'Away' and vice versa. I don't want to return the amount of times a value is displayed.

import pandas as pd

d = ({
    'Who' : ['Home','Away','','','Home','Away','Home','Home','Home','','Away','Home'],
    })

df = pd.DataFrame(data = d)

I tried this.

df['Home_count'] = (df['Who'] == 'Home').cumsum()
df['Away_count'] = (df['Who'] == 'Away').cumsum()

Which returns:

     Who  Home_count  Away_count
0   Home           1           0
1   Away           1           1
2                  1           1
3                  1           1
4   Home           2           1
5   Away           2           2
6   Home           3           2
7   Home           4           2
8   Home           5           2
9                  5           2
10  Away           5           3
11  Home           6           3

But I'm trying to count the amount of times it changes. Not the total count of each value. So if it reads Home, Home, Home, Away, there should only be a count next to Away. Not 1,2,3 against Home.

Home 1 #Theres a change so provide a count
Home   #No change so no count
Home   #No change so no count
Away 1 #Theres a change so provide a count
Home 2 #Theres a change so provide a count

Please refer to intended output:

Intended output:

   Count_Away Count_Home   Who
0                      1  Home
1           1             Away
2                             
3                             
4                      2  Home
5           2             Away
6                      3  Home
7                         Home
8                         Home
9                             
10          3             Away
11                     4  Home

Are you just looking for the total counts(no of time each of the word(home,Away) is changed) — min2bro, Jun 26 '18 at 02:39
No. I'll update the question sorry. I'd like it be to on the row were it changes — , Jun 26 '18 at 02:44
Why doesn't the first home game start at 1, but the First Away game does? — ALollz, Jun 26 '18 at 02:51
Also if there is a blank line between two Away (or home) rows, do you still want it to increment or not? — ALollz, Jun 26 '18 at 03:02
@ALollz because I only want to count the times it changes from Home to Away and vice versa. I can update this later though so I've altered the question for ease of understanding. — , Jun 26 '18 at 03:19
Not too worried about the incremental display. I would like it as per my intended output. But I can always drop duplicates. — , Jun 26 '18 at 03:20

cs95 · Accepted Answer · 2018-06-26T03:40:13.710

1

Use pd.get_dummies to get a DataFrame of one-hot encodings
Compute the cumulative sum with cumsum
Find the points of change using by comparing v with its shifted version
Fill NaNs with the empty string
Concatenate the result with the original

v = pd.get_dummies(
      df.where(df.Who.ne(df.Who.shift()) & df.Who.str.len().astype(bool)
   ), prefix='Count'
).cumsum()

df = pd.concat([
     v.where(v.ne(v.shift())).fillna('', downcast='infer'), df
  ], axis=1
)

print(df)
   Count_Away Count_Home   Who
0           0          1  Home
1           1             Away
2                             
3                             
4                      2  Home
5           2             Away
6                      3  Home
7                         Home
8                         Home
9                             
10          3             Away
11                     4  Home

edited Jun 26 '18 at 03:40

answered Jun 26 '18 at 03:28

cs95

379,657
97
704
746

I've updated the question. Hopefully it's clearer. 3,4,5 next to Home should be 3,' ', ' ', until the 11th index, which should be 4. – Jun 26 '18 at 03:31
@Punter345 It's not clear, and I don't understand why it should be 3, '', ''. Sorry. – cs95 Jun 26 '18 at 03:33
I've got a `cum count` of both Home and Away at the moment. But I want to count the amount of times it changes between Home and Away. So if it goes Home, Home, Home, there is no change there. – Jun 26 '18 at 03:35
@Punter345 Okay, I think I get it. Try it now. – cs95 Jun 26 '18 at 03:37
yep. Brilliant. Thankyou! – Jun 26 '18 at 03:38
@Punter345 Keep in mind that iteration should always be the last resort. I suggest you test this answer as well as the current accepted answer and figure out what works best for you and accept that. – cs95 Jun 26 '18 at 03:39
1

This is nice. Though might increment the count for `Home` blank `Home`. Always good to see how to do it without a loop. I guess mine could be done by a groupby, but yeah, not sure that's an improvement. – ALollz Jun 26 '18 at 03:49

score 0 · Answer 2 · answered Jun 26 '18 at 02:59

Shows the count of each of the words Home & Away every time it changes it's value in both the columns.

import pandas as pd

d = ({
    'Who' : ['Home','Away','','','Home','Away','Home','Home','Home','','Away','Home'],
    })
df = pd.DataFrame(data = d)



countaway=0
counthome=0
df['Count_Away']=0
df['Count_Home']=0

for index,rows in df.iterrows():    
    if(rows['Who']=='Home'):
        df['Count_Home'].values[index]=counthome+1
        counthome+=1
    else:
        df['Count_Home'].values[index]=0 
    if(rows['Who']=='Away'):
        df['Count_Away'].values[index]=countaway+1
        countaway+=1
    else:
        df['Count_Away'].values[index]=0

Output:

   Who  Count_Away  Count_Home
0   Home    0         1
1   Away    1         0
2           0         0
3           0         0
4   Home    0         2
5   Away    2         0
6   Home    0         3
7   Home    0         4
8   Home    0         5
9           0         0 
10  Away    3         0
11  Home    0         6

In your sample output in the Question the first home starts with 0 and Away starts with 1, which looks like a discrepancy. — min2bro, Jun 26 '18 at 03:03
[`df.iterrows()`](https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues) is generally regarded as somewhat of a last resort. You can do this with `df['Who'].replace('', np.nan).ffil()`, then mask based on testing the first differences and take the cumulative sums. — Brad Solomon, Jun 26 '18 at 03:14
@min2bro I've updated the question for an easier understanding. I'm currently returning what you've put forward. I need to provide a count of when the values change. Not the total count of values. So if it goes Home, Home, Home, it shouldn't be counted as 3 times. — , Jun 26 '18 at 03:23

score 0 · Answer 3 · answered Jun 26 '18 at 03:35

Here's a method which will only count changes when it changes from Home to Away and vice-versa. It won't increment if there is a blank Who between two of the same types.

import pandas as pd
import numpy as np

whos = ['Home', 'Away']
for who in whos:
    # Find where `Who` is not consecutive based on index. Don't consider blank gaps
    # when determining changes. 
    s = df[df.replace('', np.NaN).fillna(method='ffill').Who==who].index.to_series().diff()!=1

    # Get the counts, align to original df based on index.
    df['Count_'+who] = s[s].cumsum()

    # Replace NaN with empty string to match your output
    df['Count_'+who] = df['Count_'+who].replace(np.NaN, '')

Output:

     Who Count_Home Count_Away
0   Home          1           
1   Away                     1
2                             
3                             
4   Home          2           
5   Away                     2
6   Home          3           
7   Home                      
8   Home                      
9                             
10  Away                     3
11  Home          4           
12                            
13  Home

Cumulative count between two columns in a pandas df

3 Answers3