2

I have about 11 years of data in a bunch of Avro files. I wanted to partition by the date of each row, but from the documentation it appears I can't because there are too many distinct dates?

Does clustering help on this? The natural cluster key for my data would still have some that'd have data for more than 4000 days.

Ask Bjørn Hansen
  • 6,784
  • 2
  • 26
  • 40
  • How many rows per day do you have? One option would be to partition by week or month, and cluster by date or timestamp. – Elliott Brossard Feb 17 '19 at 09:42
  • I'm a fan of yearly tables, then you can query all of them - if needed - with a * – Felipe Hoffa Feb 17 '19 at 19:51
  • 1
    @ElliottBrossard I'm using the "partition on timestamp column" feature that can only use "day". After loading the data I'm only up to 3661 days, so I have a year to see if the limit gets increased I guess. I only have about 20-25 million rows per month, but part of this work is to be able to add ~10x the data. – Ask Bjørn Hansen Feb 17 '19 at 21:22
  • @FelipeHoffa The old (non-BigQuery) system did that and it wasn't nearly as convenient as having "built-in" partitions. I have the same data in ClickHouse and just having one table is great for non-automated queries. – Ask Bjørn Hansen Feb 17 '19 at 21:23
  • Some of my use cases are to fetch just a day or two of data for a particular set of IDs, so currently I partition on day (in one big table) and cluster on the other ID columns. At a glance it seems to work well, except for running out of partitions sometime in the next year. – Ask Bjørn Hansen Feb 17 '19 at 21:27
  • could the previous system refer to multiple tables with *? – Felipe Hoffa Feb 17 '19 at 22:42

4 Answers4

1

two solutions i see:

1) Combine tables sharding (per year) with time partitioning based on your column. I never tested that myself, but it should work, as every shard is seen as a new table in BQ. With that you are able to easily address the shard plus the partition with one wildcard/variable.

2) A good workaround is to create an extra column with the date of you field which should be partitioned. For every data entry longer ago than 9 years (eg: DATE_DIFF(current_date(), DATE('2009-01-01'), YEAR)) format your date to the 1st of the particular month.

With that you are able to create another 29 years of data.

Be aware that you cannot filter based on that column with a date filter eg in DataStudio. But for query it works.

Best Thomas

Thomas
  • 11
  • 2
1

Currently as per doc clustering is supported for partition table only. In future it might support non-partition tables.

You can put old data per year in single partition. You need to add extra column to you table for partioning it.

Say, all data for year 2011 will go to partition 20110101. For newer data (2019) you can have seperate partition for each date.

This is not a clean solution to problem but using this you can optimize further by using clustering to provide minimal table scan.

1

4,000 daily partitions is just over 10 years of data. If you require a 'table' with more than 10 years of data one workaround would be to use a view:

  1. Split your table into decades ensuring all tables are partitioned on the same field and have the same schema
  2. Union the tables together in a BigQuery view

This results in a view with 4,000+ partitions which business users can query without worrying about which version of a table they need to use or union-ing the tables themselves.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Matt Laz
  • 230
  • 1
  • 2
  • 8
1

It might make sense to partition by week/month/year instead of day - depending on how much data you have per day.

In that case, see:

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325