22

I'm trying to calculate a running sum over a partition. This seems easier and quicker than the method suggested in BigQuery SQL running totals.

For example:

SELECT corpus,corpus_date,word_count, sum(word_count) over (partition by corpus,corpus_date order by word_count,word DESC) as running_sum FROM [publicdata:samples.shakespeare]

I'm facing 2 problems:

  1. I'm unable to let the sum start with the most common word (word with highest word_count). Setting DESC or ASC just doesn't change anything, and the sum starts with the least common word(s). If I change the order by to include only "order by word_count" than the running sum isn't correct since rows with the same order (== same word_count) yield the same running sum.

  2. In a similar query I'm executing (see below), the first row of the running sum yields a sum of 0, although the field I sum upon isn't 0 for the first row. Why does this happen? How can I workaround the problem to show the correct running sum? The query is:

select * from
(SELECT
mongo_id,
account_id,
event_date,
trx_amount_sum_per_day,
SUM (trx_amount_sum_per_day) OVER (PARTITION BY mongo_id,account_id ORDER BY event_date DESC) AS running_sum,
ROW_NUMBER() OVER (PARTITION BY mongo_id,account_id ORDER BY event_date DESC) AS row_num
FROM [xs-polar-gasket-4:publicdataset.publictable]
) order by event_date desc

Community
  • 1
  • 1
Lior
  • 1,357
  • 3
  • 21
  • 29
  • To expedite question 2: Can you share a public dataset with a sample of the data? – Felipe Hoffa Nov 28 '13 at 00:57
  • I shared the data. See revised question 2 for the problematic query – Lior Nov 30 '13 at 17:56
  • thanks for sharing - and indeed, there is something that needs further investigation for question 2 (reporting it internally) – Felipe Hoffa Dec 02 '13 at 18:26
  • Can you please update, when relevant, on the status of this issue? – Lior Dec 04 '13 at 12:05
  • There's actually an issue with OVER, PARTITION, and floats - it's being fixed (thanks for the report!). While the fix gets to production a workaround is casting to integers, as in: SELECT SUM(INTEGER(ROUND(x))) OVER(PARTITION BY y ORDER BY o) – Felipe Hoffa Dec 05 '13 at 02:11
  • Has this issue been fixed in production? Is there a way to track this bug? – Lior Jan 09 '14 at 11:59

1 Answers1

26

For question 1:

Change:

SELECT
  corpus, corpus_date, word_count, SUM(word_count)
OVER
  (PARTITION BY corpus, corpus_date
  ORDER BY word_count, word DESC) AS running_sum
FROM [publicdata:samples.shakespeare]

To:

SELECT
  corpus, corpus_date, word_count, SUM(word_count)
OVER
  (PARTITION BY corpus, corpus_date
  ORDER BY word_count DESC, word) AS running_sum
FROM [publicdata:samples.shakespeare]

(Original query is sorting by word, but you wanted to sort by word_count)

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325