Operation timed out - BigQuery optimizing window function

Question

I highly appreciate any suggestions to optimise the Query below as this Query is timing out in the State it is as of right now. As BikeLogsTable contains a about 10000 - 1 Mil Rows which can be Joined on each Row and as it is, this Query is timing out after 6 hours.

Note that the core goal here is to associate each Row of BikesTable with the correlating Row in States. In this step I'm attempting to determine the associated Row of BikeLogTable and retrieve the Row number to determine the last associated entry.

Select Bikes.vehicleId,Bikes.timestamp_field_24,States.to,States.vehiclestatechangeid
ROW_NUMBER() OVER (
    PARTITION BY Bikes.timestamp_field_24,Bikes.vehicleId 
    ORDER BY States.timestamp
)

from `BikesTable` as Bikes
Right Join `BikeLogsTable` as States
on Bikes.vehicleId = CAST(States.vehicleid as String) 

WHERE 
DATE_SUB( PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) , INTERVAL 1 DAY) < States.timestamp AND
PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) > States.timestamp AND 
States.to is not null and
 PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) > "2021-11-22" and
  PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) < "2021-11-24"

Execution Details:

As this is a one time task, I'm happy for suggestions on whacky workarounds.

For one thing, you have a lot of nested regex. It seems like you could speed this up with better regex logic which matches on multiple patterns at once, and also maybe doing this once into a temporary table rather than computing the same column many times within the select. — Michael Delgado, Dec 18 '21 at 19:15
This was unfortunately not the reduction needed, the operation still times out. However, I am trying an approach through clustered tables and executing the join through a partitioned python function. — Aleko, Dec 19 '21 at 01:02

score 2 · Accepted Answer · answered Dec 19 '21 at 01:56

2 Suggestions

Rewrite your query so your biggest table is on the left, in this case BikeLogsTable.
Do as much casting and filtering as early as possible, including before joins.

To rewrite your query a bit...

with
Bikes as (
    select 
        *,
        -- handle your casting/parsing here instead of multiple times in your where clause 
        DATE_SUB( PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) , INTERVAL 1 DAY) as ts1, 
        PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) as ts2
    from `BikesTable`
),
States as (
    select 
        * except(vehicleid), 
        -- let's cast here instead of in the join clause
        cast(vehicleid as string) as vehicleId 
    from `BikeLogsTable` 
    where to is not null -- filter this out early too!
),
joined as (

    select
        Bikes.vehicleId, Bikes.timetsamp_field_24, States.to, States.vehiclestatechangeid, States.timetstamp as states_ts
    from States
    left join Bikes using(vehicleId) -- maybe inner join depending on your actual data
    where Bikes.ts1 < States.timestamp
      and Bikes.ts2 > States.timestamp
      and Bikes.ts2 > '2021-11-22'
      and Bikes.ts2 < '2021-11-24'
)
select 
    *,
    row_number() over(partition by timestamp_field_24, vehicleId order by states_ts) as rn -- do ordering last!
from joined

This may require some additional editing to get the results you want, but this general idea should speed you up!

This was a great suggestion, thank you. However I ended up going with a clustered table by date and BikeId — Aleko, Dec 26 '21 at 00:12

Operation timed out - BigQuery optimizing window function

1 Answers1

Linked