4

I'm moving to tables partitioned by a timestamp-column with its value in milliseconds. Now I want to generate clusters by hour, which will depend on the same timestamp-column I used for partitioning.

I want to use the same column for partitioning and clustering but I'm not sure if that works to generate hourly clusters.

I was planning on adding a new column which contains only the hour related with every record, and then using this column to create my clustered table, but I want to better understand what will happen if I use the same timestamp-column that I used for partitioning.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Are you frequently querying or filtering on hour? Just curious to see if you will get much benefit from this. Either way, the documentation (https://cloud.google.com/bigquery/docs/clustered-tables) says `When you use clustering and partitioning together, the data can be partitioned by a date or timestamp column and then clustered on a different set of columns.` Seems like you probably need to use a new column. – rtenha Nov 07 '19 at 21:10
  • 1
    Please take a look at https://stackoverflow.com/questions/58743050/whats-a-good-balance-to-decide-when-to-partition-a-table-in-bigquery. Now, can you give us estimates of how much data you have per hour? And what kind of queries you want to run on it? – Felipe Hoffa Nov 07 '19 at 22:41
  • 5
    Re: rtenha. BigQuery allows partitioning and clustering on the same column now. Thanks for pointing out the documentation. It needs to be updated. – Hua Zhang Nov 07 '19 at 23:56
  • @HuaZhang can you possibly add a few lines on how partitioning and clustering on single column can improve performance? as i understand, data is ordered within a partition based on the values of the clustering column. why cluster on the same column for which there's only 1 value in the partition? Is it in cases where we have more than just 1 clustering column? – gravetii Jun 27 '22 at 10:54
  • For column-based partitioning, the column in a same partition normally has more than 1 value. For example, a table is daily partitioned on `ts` which is a timestamp column. `ts` in a partition can have different timestamps of the day. If there is enough data in the partition, the data might be divided into multiple clusters based on `ts`. A query with `ts = some_timestamp` would only scan one cluster in the partition. – Hua Zhang Jun 27 '22 at 19:09

0 Answers0