0

We're in the midst of migrating our on-prem Oracle db to the cloud. The biggest project is to move our fact table that tracks customer transactions.

Short Question: What is best way to shard/partition a fact table in BigQuery when you can't use a date field for partitioning because of 4,000 partition limit? Objective is to maximize query performance and minimize costs.


Detailed Question I do not want to duplicate the table in BigQuery, because I want it to be optimized for BigQuery. So I've been looking into partitioning, sharding and clustering. Also looking into denormalization, but that's a different question.

In our Oracle db, we simply partition by an integer date YYYYMMDD. I don't believe we can do this in BigQuery, however, due to the fact a table can only have 4,000 partitions. If we partition by day, our table can only contain a little less than 11 (4000/365) years worth of data -- which is well below what we currently need to migrate.

There are certainly other fields we could partition besides date (eg site location), but I believe date might be best.

Below are the options I'm considering. Let's say the table contains a datetime column order_date and the integer date version order_date_id

  1. Shard by year (ie all orders with order_date in 2001 go into my_table_2001, partition each table by order_date
  2. No sharding, have one big table and create an integer column for year (order_year) and use it for partition column
  3. Shard by another column (eg site location), then partition by order_year
  4. Shard by order_year and another column (eg site location), partition by order_date

If I'm going to shard tables, I definitely want to use a datetime column for partitioning so I can use wildcards to query all sharded tables. I found out using integer ranges for partitioning prevents you from using wildcards.

Also important to note that business users may want to consistently query data for large date ranges if not all available data.

Korean_Of_the_Mountain
  • 1,428
  • 3
  • 16
  • 40
  • 1
    That's an excellent question - did you consider this already? https://stackoverflow.com/questions/58743050/whats-a-good-balance-to-decide-when-to-partition-a-table-in-bigquery – Felipe Hoffa Mar 24 '20 at 18:34
  • I'm with Felipe Hoffa to suggest looking at clustering, and, if you have full control of query, use a combination of partitioning by month and clustering by date. – Yun Zhang Mar 25 '20 at 16:10

1 Answers1

0

You can consider to use the integer range partition table: https://cloud.google.com/bigquery/docs/creating-integer-range-partitions

This is a recent GA released function. You can split your integer date (YYYYMMDD) in two columns: YYYYMM and DD and make partition on YYYYMM in this way you can have 4000/12 = 333 partitions. If you need to query a single day, you can set a where clause in you query. To save the query cost, a best practice, is select only the columns of interest and set where clause (when possible).

Claudio
  • 642
  • 3
  • 12
  • I think the main issue with using integer range for partition field is the fact that I won't be able to query tables using wildcards. – Korean_Of_the_Mountain Mar 25 '20 at 15:21
  • Could you please write a example of query? I need to understand better what you mean – Claudio Mar 25 '20 at 15:32
  • When I tried I sharding tables by year and partition by integer date column, I couldn't do a query like this to query all tables from 2000 to 2009:``select * from `my_table_200* ``. – Korean_Of_the_Mountain Mar 25 '20 at 15:35
  • But if you're saying to partition by month as a datetime type then I won't have to shard because I have 333 years worth of partitions – Korean_Of_the_Mountain Mar 25 '20 at 17:06
  • One possible addition to Claudio's approach is to also cluster the table on the order_date column. When you query the table you use order_date_id to filter to a month, and order_date to filter to only data related to the dates. You can also use order_date_month, which is order_date truncated to first day of month, if you still want to time partitioning. – Hua Zhang Mar 25 '20 at 18:56