Using the same column in table for partitioning and clustering in BigQuery

Question

I'm moving to tables partitioned by a timestamp-column with its value in milliseconds. Now I want to generate clusters by hour, which will depend on the same timestamp-column I used for partitioning.

I want to use the same column for partitioning and clustering but I'm not sure if that works to generate hourly clusters.

I was planning on adding a new column which contains only the hour related with every record, and then using this column to create my clustered table, but I want to better understand what will happen if I use the same timestamp-column that I used for partitioning.

Are you frequently querying or filtering on hour? Just curious to see if you will get much benefit from this. Either way, the documentation (https://cloud.google.com/bigquery/docs/clustered-tables) says `When you use clustering and partitioning together, the data can be partitioned by a date or timestamp column and then clustered on a different set of columns.` Seems like you probably need to use a new column. — rtenha, Nov 07 '19 at 21:10
Please take a look at https://stackoverflow.com/questions/58743050/whats-a-good-balance-to-decide-when-to-partition-a-table-in-bigquery. Now, can you give us estimates of how much data you have per hour? And what kind of queries you want to run on it? — Felipe Hoffa, Nov 07 '19 at 22:41
Re: rtenha. BigQuery allows partitioning and clustering on the same column now. Thanks for pointing out the documentation. It needs to be updated. — Hua Zhang, Nov 07 '19 at 23:56
@HuaZhang can you possibly add a few lines on how partitioning and clustering on single column can improve performance? as i understand, data is ordered within a partition based on the values of the clustering column. why cluster on the same column for which there's only 1 value in the partition? Is it in cases where we have more than just 1 clustering column? — gravetii, Jun 27 '22 at 10:54
For column-based partitioning, the column in a same partition normally has more than 1 value. For example, a table is daily partitioned on `ts` which is a timestamp column. `ts` in a partition can have different timestamps of the day. If there is enough data in the partition, the data might be divided into multiple clusters based on `ts`. A query with `ts = some_timestamp` would only scan one cluster in the partition. — Hua Zhang, Jun 27 '22 at 19:09

Using the same column in table for partitioning and clustering in BigQuery

0 Answers0