Is there a way to use cumsum with a threshold to create bins?

Question

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.

  amount       price
0   27   22.372505
1   17  126.562276
2   33  101.061767
3   78  152.076373
4   15  103.482099
5   96   41.662766
6  108   98.460743
7  143  126.125865
8   82   87.749286
9   70   56.065133

The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:

 sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
    df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)

The use-case is to find the average price of every 75 sold amounts of the thing.

can you show the expected output, not very clear from your question — StupidWolf, Dec 17 '20 at 00:48
You're right - there were 2 different questions. I wanted to average the price every 75 sold amounts, but then I realized this may involve splitting threshold rows into two, and decided to instead simplify it by creating a group when amount goes over 75. Thanks for being thorough! — v1z3, Dec 17 '20 at 22:04

David Erickson · Accepted Answer · 2020-12-17T03:53:03.727

Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):

i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]: 
0-75        78.513748
75-150     150.715984
150-225     61.387540
225-300     67.411182
300-375     98.829611
375-450    126.125865
450-525    122.032363
525-600     87.326831
600-675     56.065133
Name: price, dtype: float64

Solution #2 (I believe this is wrong but keeping just in case) I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:

df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]: 
     Group       price
0     0-75   74.467390
1   75-150  101.061767
2  150-225  127.779236
3  225-300   41.662766
4  300-375   98.460743
5  375-450         NaN
6  450-525  126.125865
7  525-600   87.749286

You're right - there were 2 different questions. I wanted to average the price every 75 sold amounts, but then I realized this may involve splitting threshold rows into two, and decided to instead simplify it by creating a group when amount goes over 75. Thanks for being thorough! — v1z3, Dec 17 '20 at 18:40

Is there a way to use cumsum with a threshold to create bins?

1 Answers1