2

I am interested in doing partial sums for the following dataframe:

    ID  Name    A   B
1   111 foo     248 123
2   222 bar     331 94
3   111 foo     266 102
4   111 foo     217 163
5   222 bar     194 102
6   222 bar     188 89

I could use groupby with either sum or agg, such as:

df = df.groupby(["ID", "Name"]).agg(sum).reset_index()

which yields:

    ID  Name    A   B
1   111 foo     731 388
2   222 bar     713 285

However, I would like to combine indices only until column A passes some pre-specified value, and then begin a second grouping. When that one passes the pre-specified value, begin a third grouping, and so on. For example, if the threshold is set at 500, the code would yield:

    ID  Name    A   B
1   111 foo     514 225
2   222 bar     525 196
3   111 foo     217 163
4   222 bar     188 89

Rows 1 and 3 in the original df were grouped. Rows 2 and 4 were grouped. Row 5 does not group with rows 1 and 3 because the threshold of 500 has been passed. Row 6 was similarly left ungrouped.

Order of rows does not matter. Which rows are combined with which other rows does not matter. I just need the ability to group column values with a threshold. I'm stumped, especially when trying to figure out a Pythonic solution as opposed to iterating through the dataframe row by row and explicitly evaluating each row. Any feedback would be much appreciated.

jda
  • 506
  • 1
  • 4
  • 15
  • 2
    I think looping is necessary for such a calculation. There's a lot of useful information here: https://stackoverflow.com/questions/56904390/restart-cumsum-and-get-index-if-cumsum-more-than-value – ALollz Aug 17 '20 at 14:15
  • Very useful, not least because I didn't know about Numba. Many thanks! – jda Aug 17 '20 at 14:19

1 Answers1

2

You can do this with a custom function to pass to the apply function. First use cumsum to identify the group ends, create an extra column with the new group id, then perform another groupby on the new intermediate dataframe.

I've made threshold a parameter in the funtion.

def grouper(x,threshold=500):
    A = (x['A'].cumsum().values/threshold).astype(int)
    loc = (np.diff(A)!=0).nonzero()[0]+1
    A[loc] = A[loc]-1 
    x['C'] = A
    
    return x.groupby(['C'])['A','B'].sum().reset_index(drop=True)
    
    

df.groupby(["ID", "Name"]).apply(grouper,threshold=500)
EddyG
  • 675
  • 4
  • 13