0

I have a table with 22 billion rows and I was asked to do a 30-day trailing sum on certain metrics, partitioned by day, then product. The answers I have found online recommend a self-join to determine this trailing sum, but on a table that is so large, I am extremely hesitant to even test it out from a cost-based perspective.

Is there a way to calculate a trailing sum without doing an extremely costly self-join?

  • After the first run, you most likely will have a faster way by adding the new parts and removing the too old ones... (shifting the result by one day, not by 22 billions) – B. Go Dec 05 '19 at 22:23
  • Sorry, I should have been more clear. I was asked to make an aggregation of a metric, divided up every single day by every single product group, with each total in that day being the trailing sum of the last thirty days; – Alex Tseng Dec 05 '19 at 22:34
  • If approximate results are fine, using HyperLogLog does this at an awesome speed https://stackoverflow.com/a/54816697/132438 – Felipe Hoffa Dec 05 '19 at 22:41
  • See https://stackoverflow.com/a/35471601/864696 – Ross Presser Dec 05 '19 at 22:45
  • Is your table partitioned? It will probably make sense to do it in batches/chunks to get the historical metrics, perhaps yearly or monthly. Once you are "caught up", calculating each new day's metric should be pretty easy/small. – rtenha Dec 05 '19 at 22:46
  • @RossPresser yeah, thats very close to what I need, thank you. My last problem with that is that I was already partitioning on a day granularity but I want the sum of the last thirty days so I;m unsure how to "reach over" the partitions I've already defined so to speak. – Alex Tseng Dec 05 '19 at 23:25
  • 1
    Ok, I figured out a solution. Use the Over function in bigquery analytics to reduce the range it is doing the self join on, then store the trailing results in its own table in the pipeline. THEN work on it with the request-specific partitions to get what I need. – Alex Tseng Dec 05 '19 at 23:44

1 Answers1

1

You would use window functions. If you have one row per date:

select t.*,
       sum(val) over (partition by day, product
                      order by date
                      rows between 29 preceding and current row
                     ) as sum_30
from t;

If you want 30 days and there might be gaps, then you need to convert the dates to a number, for which you can use unix_date(). That would be:

select t.*,
       sum(val) over (partition by day, product
                      order by unix_date(date)
                      range between 29 preceding and current row
                     ) as sum_30
from t;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786