14

Any idea how to calculate running total in BigQuery SQL?

id   value   running total
--   -----   -------------
1    1       1
2    2       3
3    4       7
4    7       14
5    9       23
6    12      35
7    13      48
8    16      64
9    22      86
10   42      128
11   57      185
12   58      243
13   59      302
14   60      362 

Not a problem for traditional SQL servers using either correlated scalar query:

SELECT a.id, a.value, (SELECT SUM(b.value)
                       FROM RunTotalTestData b
                       WHERE b.id <= a.id)
FROM   RunTotalTestData a
ORDER BY a.id;

or join:

SELECT a.id, a.value, SUM(b.Value)
FROM   RunTotalTestData a,
       RunTotalTestData b
WHERE b.id <= a.id
GROUP BY a.id, a.value
ORDER BY a.id;

But I couldn't find a way to make it work in BigQuery...

Sasa
  • 301
  • 1
  • 2
  • 8

4 Answers4

37

2018 update: The query in the original question works without modification now.

#standardSQL
WITH RunTotalTestData AS (
  SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)]) 
)

SELECT a.id, a.value, (SELECT SUM(b.value)
                       FROM RunTotalTestData b
                       WHERE b.id <= a.id) runningTotal
FROM   RunTotalTestData a
ORDER BY a.id;

enter image description here

2013 update: You can use SUM() OVER() to calculate running totals.

In your example:

SELECT id, value, SUM(value) OVER(ORDER BY id)
FROM [your.table]

A working example:

SELECT word, word_count, SUM(word_count) OVER(ORDER BY word)
FROM [publicdata:samples.shakespeare]
WHERE corpus  = 'hamlet'
AND word > 'a' LIMIT 30;
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • This could actually be very useful in financial monte carlo simulations where we need cumulative sums of small random deviations to simulate a price path. – Paul Mar 19 '15 at 05:21
  • This syntax works in standard SQL, too. See the documentation on analytic functions. https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions – Tim Swast Feb 10 '17 at 22:50
  • 1
    The 2018 update shows a `correlated subquery`, and these are generally awful in databases. The correlated subquery is effectively doing a triangular-self-join; every single row runs the calculation to sum all the preceding rows. I am not privy to the internals of Bigquery, and it is entirely possible that the query planner can transform this into something else like the window function `sum() over()` that should only require a single pass over the data, but if it doesn't, then allowing the original syntax in the 2018 update is a backward step. – Davos Jan 24 '19 at 13:57
4

You probably figured it out already. But here is one, not the most efficient, way:

JOIN can only be done using equality comparisons i.e. b.id <= a.id cannot be used.

https://developers.google.com/bigquery/docs/query-reference#joins

This is pretty lame if you ask me. But there is one work around. Just use equality comparison on some dummy value to get the cartesian product and then use WHERE for <=. This is crazily suboptimal. But if your tables are small this is going to work.

SELECT a.id, SUM(a.value) as rt 
FROM RunTotalTestData a 
JOIN RunTotalTestData b ON a.dummy = b.dummy 
WHERE b.id <= a.id 
GROUP BY a.id 
ORDER BY rt

You can manually constrain the time as well:

SELECT a.id, SUM(a.value) as rt 
FROM (
    SELECT id, timestamp RunTotalTestData 
    WHERE timestamp >= foo 
    AND timestamp < bar
) AS a 
JOIN (
    SELECT id, timestamp, value RunTotalTestData 
    WHERE timestamp >= foo AND timestamp < bar
) b ON a.dummy = b.dummy 
WHERE b.id <= a.id 
GROUP BY a.id 
ORDER BY rt

Update:

You don't need a special property. You can just use

SELECT 1 AS one

and join on that.

As billing goes the join table counts in the processing.

Emery Lapinski
  • 1,572
  • 18
  • 26
Alen Vrečko
  • 886
  • 10
  • 13
  • 1
    2013 update: you can use SUM() OVER() for a simpler query http://stackoverflow.com/questions/14664578/bigquery-sql-running-totals/20480827#20480827 – Felipe Hoffa Dec 09 '13 at 21:09
1

The problem is with the second query, that BigQuery will UNION the 2 tables in the FROM expression.

I'm not sure about the first one, but it's possible that bigquery doesn't like subselects at the Select expressions, only at the FromExpression. So you need to move the subquery into the fromexpression, and JOIN the results.

Also, you could give it a try to our JDBC driver: Starschema BigQuery JDBC Driver

Just simply load it into Squirrel SQL, or RazorSQL or kinda any tool that supports JDBC drivers, make sure you turn on the Query Transformer by setting:

transformQuery=true

In the properties or in the JDBC url, every info can be found at the project page. After you did this, try to run the 2nd query, it will be transformed into a BigQuery compatible join.

Balazs Gunics
  • 2,017
  • 2
  • 17
  • 24
  • Can you give the example of the BigQuery query? I cannot see how you can move the subselect from SELECT to FROM as it references B? And you cannot JOIN ON <=. What am I missing? – Alen Vrečko Feb 05 '13 at 23:46
  • thanks for hints - but no luck with BQ jdbc driver (and SquirrelSQL) - always getting error (sqlState: null, errorCode: 0). – Sasa Feb 06 '13 at 07:33
  • @AlenVrečko If you turn on the logging, with the level set on debug, it logs out the Parsed queries too. We made our parser to be compatible with Reporting tools, because of that we built up an ANTLR grammar from 0, and it only accepts subqueries at the FROM. [Examples can be found here](http://code.google.com/p/starschema-bigquery-jdbc/wiki/QueryTransformationEngine) – Balazs Gunics Feb 06 '13 at 09:34
  • @Sasa Thats bad, but as I read the accepted answer that was because of the <= in the WHERE, we leave the comparison operator as it is. – Balazs Gunics Feb 06 '13 at 09:34
1

It's easy if we are allow to use window function.

https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts

With that we can do it like this :

WITH RunTotalTestData AS (
  SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)]) 
)

select *, sum(value) over(order by id) as running_total
from RunTotalTestData

enter image description here

pkd
  • 471
  • 1
  • 6
  • 17