sum() until threshold value reached and summarize it as a single record and reset and continue the aggregation

Question

I am trying to write BigQuery SQL to apply sum() against a particular column until it reaches the threshold value, then a summarized record should be the output. The process will continue until all the records in the tables are processed and summarized, even if the sum value does not reach the threshold value.

Below is how my data will look like in the table

Expected output:

If the Threshold value is 10000000 then the SQL should give 2 rows as output for id=1. If we add the total_amount column of the first 2 records of id=1 in my raw data then it reaches the threshold value 10000000 so the output needs to be displayed as from_range as 1 and to_range as 20 and the total_amount as 10000000. Now the SQL needs to reset and do the same operation for the remaining rows of id=1 which comes to the total_amount of 210000 with from_range as 21 and to_range as 40.

But all I could achieve is to get cumulative running total_amount with the below query.

Select id, from_range, to_range, total_amount, SUM(total_amount) OVER (PARTITION BY id ORDER BY from_range asc ROWS UNBOUNDED PRECEDING) as aggregated_total_amount From Test

I've referred stack overflow questions(How to "reset" running SUM after it reaches a threshold?) related to running sum, but I couldn't solve myself for this requirement. Any guidance will be really helpful. Thanks.

score 2 · Accepted Answer · answered Jul 31 '22 at 00:33

Consider below approach

with recursive temp as (
  select *, row_number() over(partition by id order by from_range) pos
  from your_table 
), result as (
  select *, total_amount as total, true as new_group 
  from temp where pos = 1
  union all
  select t.*, 
    if(total + t.total_amount > 10000000, t.total_amount, total + t.total_amount), 
    if(total + t.total_amount > 10000000, true, false)
  from temp t join result r 
  on t.pos = r.pos + 1 and t.id = r.id
)
select id, 
  min(from_range) from_range, 
  max(to_range) to_range, 
  max(total) as total_amount
from (
  select *, countif(new_group) over(partition by id order by pos) grp 
  from result
)
group by id, grp

if applied to sample data in your question - output is

Thanks @Mikhail. Your approach works well for my requirement and extended bit further in my code and tested but considering the nature of the environment where am going to implement, I will go with your option. — Sriraam Venkataraman, Aug 10 '22 at 15:30

Jaytiger · Answer 2 · 2022-07-31T07:37:25.483

I think you can partition ranges not to be greater than threshold using an UDF like below:

CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
  bin = 0;
  a.reduce((c, v) => {
    if (c + Number(v) > 10000000) { bin += 1; return Number(v); }
    else return c += Number(v); 
  }, 0);

  return bin;
""";

and then aggregate total_amount for each bin of id:

WITH binning AS (
  SELECT *, cumsumbin(ARRAY_AGG(total_amount) OVER (PARTITION BY id ORDER BY from_range)) bin 
    FROM sample_table
)
SELECT id, 
       MIN(from_range) from_range,
       MAX(to_range) to_range,
       SUM(total_amount) total_amount
  FROM binning
 GROUP BY id, bin;

Thanks @Jaytiger for your guidance. I tested this option also for my requirement by extending bit further but am going with Mikhail's option. — Sriraam Venkataraman, Aug 10 '22 at 15:32

sum() until threshold value reached and summarize it as a single record and reset and continue the aggregation

2 Answers2