1

I was hopping to use your experience in BigQuery to decide the best structure for my data, I have 5 years experience in relational database especially Oracle DB, and I am familiar with the best practices to use in relational database to build a data warehouse, but when it comes to cloud solution, I'm still new

My question is about partitioned table in BigQuery, As far as I know we only have partition on day in BigQuery,

so a lot of my data have low count of rows on daily basis (between 1K to 12K max) , but they contains data for a long period of time, so is it a good idea to partition the table based on the day? if I partitioned my data into tables for each year, and created a view to represent all the tables, is this going to help? or is there any other solution on BigQuery. In your experience, when it's efficient to use partitioning in BQ, how large should the daily data be?

if you have any reference or book about best practices in using BQ for DWH please provide it to me.

James Z
  • 12,209
  • 10
  • 24
  • 44
Joe
  • 274
  • 2
  • 13

2 Answers2

1

so a lot of my data have low count of rows on daily basis (between 1K to 12K max) , but they contains data for a long period of time, so is it a good idea to partition the table based on the day?

That would depend on how wide timeframe is covered by the queries you are going to use.

Case A.
For example, if the maximum duration of your queries is 1 week then partitioning into daily partitions can be very beneficial in terms of speed and cost because the engine wouldn't need to scan the whole table. It would select 7 daily partitons only.
Example: For the public dataset bigquery-public-data.samples.github_timeline, a query that covers 1 day results in data usage of ~hundreds of megabytes due to table scan. Working with a similar daily partitioned table reduces the data usage incurred by 1 day query to ~10 MB.

Case B.
On the other hand, if the most frequently run queries cover timeframe that is one or two years wide (e.g. timewise the whole span of data in your table), then having daily partitons doesn't help much, the engine needs to process nearly all the table anyway. So having daily partitions won't help. Clustering can have detrimental effect on data usage if the amount of daily data is less than the minimal cluster size because the engine will end up processing more disk space.

It would be helpful if similar considerations were documented along with the information what the actual minimal cluster size is and what the minimal partition size is (if any). If these figures are dynamic then what are the expected ranges.

Useful answers:
1
2
3

winwiz1
  • 2,906
  • 11
  • 24
  • actually most of the filters is for period between 3 months and 2 years, and some tables contains only hundreds of rows for every single day, that's why I'm not sure wether to use partitioning or not cause usually bad partitioning would make the things worst rather than help. – Joe Dec 17 '19 at 05:28
1

You have two partition modes in Bigquery

  1. partition by date
  2. partition by integer range

both can hold 4000 partitions, so choose wisely the options.

If your date range is larger or expected to be larger than 4000 days, you might want to partition by week/month.
For integer range, you can use unixtime as integer and the range to span more than 10 years, and each range would be larger than 24 hours. It's math.

Choose a specific column naming convention for partitioned tables eg: partitioned, or timestamp.

ps. you can apply 4 other columns for clustering your table further.

a practical example would be:

  • partitioned by timestamp, clustered by country
  • partitioned by weekly timestamp, clustered by day column
  • partitioned by integer range column eg: a unix time, clustered by event, category, language
Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • so it's possible to partition by weekly or monthly timestamp? or I will have to create a new column with a week value inside of it? – Joe Dec 17 '19 at 05:23
  • @Joe You need to create a new column for it. – Pentium10 Dec 17 '19 at 06:53
  • Will this help? because if I created a report and I want to give the customer the ability to select any range of dates not only weeks , so the filter will be on the day date field – Joe Dec 17 '19 at 08:37
  • You only need to do WEEKly dates if you don't fit the 4000 partitions limit. – Pentium10 Dec 17 '19 at 09:35
  • actually most of the filters is for period between 6 months and 2 years, and some tables contains only a hundreds to few thousands of rows for every single day, that's why I'm not sure wether to use daily partitioning or not ( if that would save some costs or enhance performance) cause usually bad partitioning would make the things worst rather than help – Joe Dec 18 '19 at 09:02
  • @Joe if you say that bad partitioning make things worst you are not aware of some things. Partitioning is used mostly for reducing costs. Partitioning is used to scan less data. Partitioning in BigQuery is not similar to relational database, so its implementation doesn't have "Worst". Well there is one. Loading data into partition is slightly slower. But at load. Query is fast. – Pentium10 Dec 18 '19 at 09:43
  • I'm new to BigQuery +BQ not well documented unlike common RDBMS) in relational DB partitioning will give a better performance, easier management and even might help loading operations , but in case we chose the partitioning technique wisely, otherwise it might complicate things and sometimes affect the whole DB performance. but for BigQuery I'm not sure if there's any disadvantages or bad practices for choosing partitions, but as I understand from you the only impact would be on loading time but it wont affect anything else I just have to be aware of the number of partitions limit correct? – Joe Dec 20 '19 at 04:37
  • Impact is only at loading, and the limit on maximum 4000 partitions. You get a ton of benefits after turning on partitioning. – Pentium10 Dec 20 '19 at 08:35