4

I have a dataframe similar to the following:

import pandas as pd
data = {"Name":["Andrew","Andrew","Andrew","Andrew","Andrew","Andrew","Andrew", "Sam", "Sam", "Sam", "Sam", "Sam"], "PASS":[0, 1, 1, 0, 1, 1, 1, 0, 1, 1,0,1]}
df = pd.DataFrame(data=data)

Output

    Name    PASS
0   Andrew  0
1   Andrew  1
2   Andrew  1
3   Andrew  0
4   Andrew  1
5   Andrew  1
6   Andrew  1
7   Sam     0
8   Sam     1
9   Sam     1
10  Sam     0
11  Sam     1

I want to generate a dataframe which contains the largest consecutive passes for each student:

    Name    MAX_PASS
0   Andrew  3
1   Sam     2

I need a little help modifying the code that I have so far. The count is outputting 0110110110 and the result = 2. Which is not quite correct. I think I'm close but need some help to get over the finish line. Thanks.

count = ''
for i in range(len(df)-1):
    if df.Name[i] == df.Name[i+1]:
        if df.PASS[i] == 0:
            count += "0"
        else:
            count += "1"  
            result = len(max(count.split('0')))
rpanai
  • 12,515
  • 2
  • 42
  • 64
AndronikMk
  • 151
  • 1
  • 10
  • 2
    Use groupby, agg `df.groupby('Name').agg('sum')` . Documentation here for you https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html – wwnde Jan 06 '22 at 22:33

2 Answers2

2

You can consider to adapt this answer

def max_strike_group(x, col):
    x = x[col]
    a = x != 0
    out = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
    return out.max()

df.groupby("Name").apply(lambda x:max_strike_group(x, "PASS"))
Name
Andrew    3
Sam       2
dtype: int64
rpanai
  • 12,515
  • 2
  • 42
  • 64
2

One option is to call on cumsum twice, the first time is to add up the 0s and 1s, the second time is to get the values after the reset:

TL-DR:

cum1 = df.groupby('Name').PASS.cumsum()
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
(df.assign(PASS = cum1, 
          max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)

Name
Andrew    3
Sam       2
Name: max_pass, dtype: int64

Explanation:

# first cumulative sum
cum1 = df.groupby('Name').PASS.cumsum()
cum1
0     0
1     1
2     2
3     2
4     3
5     4
6     5
7     0
8     1
9     2
10    2
11    3
Name: PASS, dtype: int64
# look for rows where the reset should occur
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
cum1
array([ 0,  1,  1, -2,  1,  1,  1,  0,  1,  1, -2,  1])
# build the max_pass column
# with the second cumsum and groupby
# before grouping again to get the max
(df.assign(PASS = cum1, 
          max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)

Name
Andrew    3
Sam       2
Name: max_pass, dtype: int64
sammywemmy
  • 27,093
  • 4
  • 17
  • 31