11

I know there is an AVG function in Big Query and there are window functions to shift previous or next values up or down a row, but is there any function that allows you to average over a specified interval? For instance I'd like to so something like the following:

SELECT
    city
    AVG(temperature) OVER(PARTITION BY city, INTERVAL day,14, ORDER BY day) as rolling_avg_14_days,
    AVG(temperature) OVER(PARTITION BY city, INTERVAL day,30, ORDER BY day) as rolling_avg_30_days,
WHERE
    city IN ("Los Angeles","Chicago","Sun Prairie","Sunnyvale")
    AND year BETWEEN 1900 AND 2013

I'd like to do a rolling average calculation that allows me to specify a range of values to do an aggregation function over, and what value to order by. The average function would take the current day temp and previous 13 days (or previous 29) to calculate and average. Is this possible today? I know I could do something like this if I put 13 LAG/OVER fields in the SELECT statement and then avg the results of all of them, but that is a lot of overhead.

Steve Scherer
  • 315
  • 1
  • 5
  • 16
  • I did find a [previous post](http://stackoverflow.com/questions/12763404/bigquery-moving-average) from Jordan about using a JOIN to produce a rolling average. I tried using that logic and it works great, however I'm wondering if there is a window function in development that would make the logic a bit simpler. – Steve Scherer Mar 07 '14 at 16:44

2 Answers2

22

I think OVER with RANGE construction of Window Functions is best fit here

Assuming that day field is represented as 'YYYY-MM-DD' format, below query does the rolling averages

SELECT
  city,
  day,
  AVG(temperature) OVER(PARTITION BY city ORDER BY ts 
                RANGE BETWEEN 14*24*3600 PRECEDING AND CURRENT ROW) AS rolling_avg_14_days,
  AVG(temperature) OVER(PARTITION BY city ORDER BY ts 
                RANGE BETWEEN 30*24*3600 PRECEDING AND CURRENT ROW) AS rolling_avg_30_days
FROM (
  SELECT day, city, temperature, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts 
  FROM temperatures
)

You most likely already found this solution long ago, but still wanted to have what i think a better answer (as of today) here for this question

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • This is now legacy syntax – zthomas.nc Feb 01 '21 at 20:02
  • First - this is 2014 question and 2016 answer - so what the point of down voting!?Secondly - which part is legacy!? I think you are mistaken! - If you mean `TIMESTAMP_TO_SEC` so it is not a focus or either question or answer! The focus was `OVER with RANGE`. Anyway ... – Mikhail Berlyant Feb 01 '21 at 20:25
  • Hi Mikhail, thank you for this answer, I am learning from it now to adapt this to my current challenge. Can you explain the 14*24*3600 portion? I understand that 14 and 30 are to account for 14 days or 30 days. What do 24 and 3600 represent? – Arsik36 Mar 04 '21 at 16:20
  • 24: 24 hours per day and 3600: 3600 seconds per hour – Mikhail Berlyant Mar 04 '21 at 16:42
  • The linked documents in the answer are now legacy. Here is a link to the update: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts Answer is still relevant – hitwill May 20 '21 at 21:46
0

A different option with JOIN EACH (this can get too slow as an extremely large amount of data can get generated in the intermediate steps):

SELECT a.SensorId SensorId, a.Timestamp, AVG(b.Data) AS avg_prev_hour_load
FROM (
  SELECT * FROM [io_sensor_data.moscone_io13]
  WHERE SensorId = 'XBee_40670EB0/mic') a
JOIN EACH [io_sensor_data.moscone_io13] b
ON a.SensorId = b.SensorId
WHERE b.Timestamp BETWEEN (a.Timestamp - 36000000) AND a.Timestamp
GROUP BY SensorId, a.Timestamp;

(based on Joe Celko's SQL problems)

For window functions, it could be useful to have one implementing a bigger range, but for now I would automate the generation of the query.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325