0

I want to understand if I have a table that has a column service_date which is a DATE type, and the partition uses this column as partition key.

CREATE TABLE `service` 
PARTITION BY service_day
AS

SELECT cus_id, fees, service_day
FROM <another_table>

But 5 years later, I want to collapse the partitions into MONTH level for legacy data, while keeping the recent in DATE level. The following solution is suggested by the author of this book "Google BigQuery: The Definitive Guide":

IF (service_date < '2018-01-01', 
    DATE_TRUNC( service_date, MONTH),
    DATE_TRUNC( service_date, DAY)
) AS partition_col

The table should look like the following, and the last column is showing the the new partition key that each row is partitioned to.

cus_ id fees service_day partition_col
... 30 2017-03-02 20170301
... 100 2017-03-15 20170301
... 50 2017-03-20 20170301
... 100 2022-05-03 20220503
... 100 2022-05-16 20220516
... 100 2022-06-01 20220601
... 100 2022-06-03 20220603

The author only mentioned this solution in a very short paragraph without detailing how this might have impact on the query side. My question is whether this approach would inevitably make the user query more complicated, because now if the user would like to have the benefits from reducing data scan, then the query has to include additional partition_col in WHERE and the user has to know at what level it is partitioned by:

SELECT * FROM `service` 
WHERE partition_col = DATE('2017-03-01')
AND   service_day = '2017-03-15'

This doesn't look intuitive -- wondering if there is better approach for partition collapse?

Katherine Chen
  • 502
  • 2
  • 5
  • 14
  • when you say "collapse the partitions into MONTH level" you want to to get data partition by month? – Sakshi Gatyan Sep 26 '22 at 06:52
  • @SakshiGatyan: yes the idea is to provide more coarse partition on legacy data (i.e. MONTH level for data that is older than a specific date) while remaining the DAILY level for data after that. And these different level of partitions are in the same table. – Katherine Chen Sep 26 '22 at 16:38
  • 1
    you can try writing your queries to operate at the level of months as mentioned in this [post](https://stackoverflow.com/questions/41585218/bigquery-table-partitioning-by-month) – Sakshi Gatyan Oct 11 '22 at 11:31
  • Thanks @sakshiGatyan. I came across that post and what David said in one of the comment was describing the shortcomings. He mentioned "The only caveat is that you need to use event_month in your queries, so if you're looking for something on 2019-01-15 you would need to specify and event_month='2019-01-01'. – Katherine Chen Oct 12 '22 at 13:21

0 Answers0