Have just created a stream from PostreSQL to BigQuery using Datastream and was pretty pleased with the results. For each table I altered the DDL after initial streaming to add daily partitions on our created_at fields assuming everything would work as it has done on our previous version of this dataset. However, the partitions are not referenced or honored when querying the data so am left with inefficient and expensive queries scanning and billing the whole table each time.
Quick example with one of our tables for the same simple query:
-- Original table NOT using Datastream, let's call the dataset "original"
SELECT max(created_at), count(1) FROM original.orders WHERE date(created_at) = '2023-05-01';
-- Query processes 1.62MB
-- New table using Datastream, let's call the dataset "datastream"
SELECT max(created_at), count(1) FROM datastream.orders WHERE date(created_at) = '2023-05-01';
-- Query processes 3.18GB
When looking into the execution details there is perhaps some insight into what is happening. The query on the datastream version of the table actually references two separate tables and then combines them. The tables have prefixes on them (1) datastream.DELTA_CDC_TABLE_orders
and (2) datastream.CDC_TABLE_orders
It makes sense that the DELTA table might not be partitioned yet and hence have to scan everything not yet added to the main table, but don't understand why the CDC_TABLE_orders
can't utilize partitions.
Anyone have ideas how to write the query so that it has to use the partitions and not scan the full table? Couldn't find anything in the documentation around this
Have tried looking into querying for a table suffix, changing the format of the partition ('YYYYMMDD', 'YYYY-MM-DD HH:MM:SS'), searching for the CDC_TABLE_ version of the table