1

How do I compute the average duration of records whose end_date is 1-hour before this record's start_date?

I can do this with a self-join:

  SELECT AVG(p.duration) AS prior_duration
  FROM `bigquery-public-data`.london_bicycles.cycle_hire c
  JOIN `bigquery-public-data`.london_bicycles.cycle_hire p
  ON c.start_station_id = p.start_station_id AND
     p.end_date BETWEEN TIMESTAMP_SUB(c.start_date, INTERVAL 3600 SECOND)
                  AND c.start_date

but how can I do it more efficiently (without a self-join)? something along the lines of:

AVG(duration)
         OVER(PARTITION BY start_station_id
         ORDER BY UNIX_SECONDS(end_date) ASC 
         RANGE BETWEEN 3600 PRECEDING AND CURRENT ROW) AS prior_duration

but which uses the start_date of current records.

Lak
  • 3,876
  • 20
  • 34
  • 1
    it is not clear what is your desired result! is it average across the trips by start_station (so the output will have as many rows as number of stations) or the average value for each and every trip based on `1-hour before this record's start` rule (so the output will have as many rows as original table)? – Mikhail Berlyant May 15 '19 at 20:47
  • @Lak I THINK I understand the question, but just for clarity, based on your most recent comment the `PARTITION BY start_station_id` clause in the 'something along the lines of...' code block above isn't needed, right? – Stewart_R May 16 '19 at 19:20
  • 1
    The partition by is needed. I should edit my explanation: output will have as many rows as original table. The average value for each and every record based on trips that were *started at this station* and ended in the 1 hour before this record's start time. – Lak May 17 '19 at 08:55

2 Answers2

2

UPDATE: See Mikhail's comment. this doesn't work. I've updated the query avoid BigQuery doing a quick optimization.

Here's an exact solution. The idea is to build an array of all the records at a station and filter over the past hour using a correlated query. It took 7 seconds to process the entire dataset.

The size of array of records at a station has to be < 100 MB, though. Group by as many fields as needed to make your arrays small enough :)

WITH all_hires AS (
  SELECT 
    start_station_id
    , ARRAY_AGG(STRUCT(duration, 
                       start_date, 
                       TIMESTAMP_SUB(start_date, INTERVAL 1 HOUR) AS start_date_m1h, 
                       end_date)) AS hires
  FROM `bigquery-public-data`.london_bicycles.cycle_hire
  GROUP BY start_station_id
),

hires_by_ts AS (
  SELECT
    start_station_id
    , h.start_date
    , (SELECT AVG(duration) FROM UNNEST(hires) 
       WHERE end_date BETWEEN h.start_date_m1h AND h.start_date)
         AS duration_prev_hour
    , (SELECT COUNT(duration) FROM UNNEST(hires) 
       WHERE end_date BETWEEN h.start_date_m1h AND h.start_date)
         AS numreturns_prev_hour
  FROM
    all_hires, UNNEST(hires) AS h
)

SELECT * from hires_by_ts
WHERE duration_prev_hour IS NOT NULL
ORDER BY duration_prev_hour DESC
LIMIT 5
Lak
  • 3,876
  • 20
  • 34
  • Neat solution. This could also be another good use case for User Defined Aggregation functions... I'm sure you have the Google influence to push this up to the top of the backlog! ;-) https://issuetracker.google.com/issues/62898282 https://stackoverflow.com/q/43974448/2455494 – Stewart_R May 17 '19 at 12:34
  • 1
    you guys got confused yourselves and the rest (or maybe some) of us by using `LIMIT 5` - Obviously it took 7 seconds - see the execution plan to understand why! – Mikhail Berlyant May 17 '19 at 20:11
  • 1
    Thanks, Mikhail. Yup, this doesn't work. If I make it "ORDER BY duration_prev_hour LIMIT 5", it goes on forever since there is no fast way to get just 5 rows in result. – Lak May 18 '19 at 21:09
1

Given that you can't use different fields in the ordering and windowing frame boundary - the only way I can think of doing this is by doing things twice and with the caveat that you might/likely will miss some rows, however:

WITH cycle_hires AS (
  SELECT 
    start_station_id,
    start_date,
    ARRAY_AGG(STRUCT(end_date, duration)) OVER (
      PARTITION BY start_station_id
      ORDER BY end_date ASC
      ROWS BETWEEN 100 PRECEDING AND CURRENT ROW
    ) AS previous
  FROM `bigquery-public-data`.london_bicycles.cycle_hire AS c
)
SELECT
  c.start_station_id,
  AVG(p.duration) AS previous_duration,
  COUNT(*) AS number_of_previous_trips_used
FROM cycle_hires AS c
  JOIN UNNEST(previous) AS p
  WHERE p.end_date BETWEEN TIMESTAMP_SUB(c.start_date, INTERVAL 3600 SECOND) AND c.start_date
GROUP BY 1

With this dataset (~24M rows) using up to a 100 preceding rows will take ~20s increasing that to a 1000 preceding rows will take ~120s.

  • 1
    unfortunately - this is total fake and has nothing to do with quality of result but obviously fast as it randomly reduces size of join! key word here is "randomly" - even though query might look cool because you can control "randomness" by setting 100 vs 1000 - the author of this answer got trapped into the very same issue as question is asking to resolve – Mikhail Berlyant May 21 '19 at 00:10
  • The goal here was to reduce the size of the join and acknowledges RANGE isn't an option given the different fields.The ORDER clause ensures that it is not random, but will miss rows when there are more than X preceding rows (using UNBOUNDED PROCEEDING would include all rows but results in poor performance). The second query uses the output from the CTE and makes sure only records within the 3600s time frame are used. – Robert Saxby May 22 '19 at 08:03