I want to understand if I have a table that has a column service_date
which is a DATE type, and the partition uses this column as partition key.
CREATE TABLE `service`
PARTITION BY service_day
AS
SELECT cus_id, fees, service_day
FROM <another_table>
But 5 years later, I want to collapse the partitions into MONTH level for legacy data, while keeping the recent in DATE level. The following solution is suggested by the author of this book "Google BigQuery: The Definitive Guide":
IF (service_date < '2018-01-01',
DATE_TRUNC( service_date, MONTH),
DATE_TRUNC( service_date, DAY)
) AS partition_col
The table should look like the following, and the last column is showing the the new partition key that each row is partitioned to.
cus_ id | fees | service_day | partition_col |
---|---|---|---|
... | 30 | 2017-03-02 | 20170301 |
... | 100 | 2017-03-15 | 20170301 |
... | 50 | 2017-03-20 | 20170301 |
... | 100 | 2022-05-03 | 20220503 |
... | 100 | 2022-05-16 | 20220516 |
... | 100 | 2022-06-01 | 20220601 |
... | 100 | 2022-06-03 | 20220603 |
The author only mentioned this solution in a very short paragraph without detailing how this might have impact on the query side. My question is whether this approach would inevitably make the user query more complicated, because now if the user would like to have the benefits from reducing data scan, then the query has to include additional partition_col in WHERE
and the user has to know at what level it is partitioned by:
SELECT * FROM `service`
WHERE partition_col = DATE('2017-03-01')
AND service_day = '2017-03-15'
This doesn't look intuitive -- wondering if there is better approach for partition collapse?