1

Have just created a stream from PostreSQL to BigQuery using Datastream and was pretty pleased with the results. For each table I altered the DDL after initial streaming to add daily partitions on our created_at fields assuming everything would work as it has done on our previous version of this dataset. However, the partitions are not referenced or honored when querying the data so am left with inefficient and expensive queries scanning and billing the whole table each time.

Quick example with one of our tables for the same simple query:

-- Original table NOT using Datastream, let's call the dataset "original"
SELECT max(created_at), count(1) FROM original.orders WHERE date(created_at) = '2023-05-01';

-- Query processes 1.62MB


-- New table using Datastream, let's call the dataset "datastream"
SELECT max(created_at), count(1) FROM datastream.orders WHERE date(created_at) = '2023-05-01';

-- Query processes 3.18GB

When looking into the execution details there is perhaps some insight into what is happening. The query on the datastream version of the table actually references two separate tables and then combines them. The tables have prefixes on them (1) datastream.DELTA_CDC_TABLE_orders and (2) datastream.CDC_TABLE_orders It makes sense that the DELTA table might not be partitioned yet and hence have to scan everything not yet added to the main table, but don't understand why the CDC_TABLE_orders can't utilize partitions.

Anyone have ideas how to write the query so that it has to use the partitions and not scan the full table? Couldn't find anything in the documentation around this

Have tried looking into querying for a table suffix, changing the format of the partition ('YYYYMMDD', 'YYYY-MM-DD HH:MM:SS'), searching for the CDC_TABLE_ version of the table

  • what column(s) did you add for the daily partitions? you just need to reference them in the WHERE clause to take advantage rather than applying a function on created_at which will have to scan all the rows – Kurt May 08 '23 at 20:39
  • The partition is on the created_at field and this is being applied in the WHERE clause of the queries already. The date function around created_at doesn't have any impact on partitioning, it should still prune the table – Matt Millen May 08 '23 at 22:29
  • I'm no expert but my hunch is the date function around created_at is what is causing the engine to ignore partitions. You're not comparing any value to the partition column, but to a function return value, so it has to call the function on every row and then apply the filter – Kurt May 09 '23 at 14:14
  • try something like `where created_at >= '2023-05-01' and created_at < '2023-05-02'` where you directly compare the partition column in the filter – Kurt May 09 '23 at 14:16
  • or change your partition to a column that contains only the truncated date part and compare directly to that – Kurt May 09 '23 at 14:21
  • Partitions work fine with date functions applied, the partition is actually on date(created_at). This seems to be an issue specific to Datastream and the prefixed tables it queries against – Matt Millen May 09 '23 at 14:28
  • Max_staleness table option seems to be the solution here – Matt Millen May 11 '23 at 22:47

0 Answers0