3

For streaming inserts, I want to use a template table (with user id suffix) which is itself a Partitioned table. This way I can make my tables smaller than just using Partitioned Tables and hence make my queries more cost-effective. Also my query cost per user stays constant irrespective of the number of users in my system. As per the documentation at https://cloud.google.com/bigquery/streaming-data-into-bigquery:-

To create smaller sets of data by date, use time-partitioned tables. To create smaller tables that are not date-based, use template tables and BigQuery creates the tables for you.

It sounds as if it can either be a time-partitioned table OR a template table. Can it not be both? If not, is there another architecture that I should look into?

One more concern regarding my above proposed architecture is the 4000 limit that I saw on https://cloud.google.com/bigquery/docs/partitioned-tables . Does it mean that my partitioned table can't cover more than 4000 days? Will I have to delete old partitions in this case or will the last partition keep storing any subsequent streamed data?

Nikhil Agarwal
  • 529
  • 1
  • 4
  • 16
  • See the following question to see how to get around this limit by partitioning by week/month/year: https://stackoverflow.com/a/56125049/132438 – Felipe Hoffa May 14 '19 at 07:29

1 Answers1

3

You should look into Clustered Tables on partitioned tables.

With that you can have ONE table with all users in it, partitioned by time, and clustered by user_id as you would use in a template table.

Introduction to Clustered Tables

When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.

Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.

Clustered table pricing

When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data. Clustered tables help you to reduce query costs by pruning data so it is not processed by the query.

Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • This is amazing but sadly it's in alpha. I have applied for using this feature but since it's not production ready, I will use simple time-partitioned tables for now and later figure out a way to copy over the data to clustered table. – Nikhil Agarwal Jul 25 '18 at 05:47
  • 1
    Don't be afraid that's in alpha. It works great I've been using for months. – Pentium10 Jul 25 '18 at 06:19
  • 1
    This is going to be beta in the next few days. This is our recommended approach and we have a number of future improvements planned for it. – Pavan Edara Jul 26 '18 at 21:01