1

I build statistical output generated on-demand from data stored in BigQuery tables. Some data is imported daily via stitch using "Append-Only". This results in duplicated observations in the imported tables (around 20kk rows growing 8kk yearly).

I could either schedule a BigQuery query to store deduplicated values in a cleaned table, or build views to do the same, but I don't understand the tradeoffs in terms of:

  • costs on BigQuery for storing/running scheduled queries and views.
  • speed of later queries dependent on deduplicated views. Do the views cache?

Am I correct to assume that daily scheduled queries to store deduplicated data is more costly (for re-writing stored tables) but speeds up later queries to the deduplicated data (saving on costs for usage) ?

The deduplicated data will namely in turn be queried hundreds of times daily to produce dashboard output for which responsiveness is a concern.

How should I argue when deciding for the better solution?

nJGL
  • 819
  • 5
  • 17
  • This is really an opinion question, but I would use scheduled queries or something like dbt. – rtenha Jul 22 '20 at 22:37

2 Answers2

2

Lets go to the facts:

  1. The price you will pay in the query is the same regardless you are using a View or a Scheduled Query
  2. When using a Scheduled Query, you will need to pay for the data you store in the de-duplicated table. As a View will not store any data, you will not have extra charges.
  3. In terms of speed, using the Scheduled Query approach wins because you have your data already de-duplicated and cleaned. If you are going to feed dashboards with this data, the View approach can lead to laziness in the dashboard loading.
  4. Another possible approach for you is using Materialized Views, which are smarter Views that periodically cache results in order to improve performance. In this guide you can find some information about choosing between Scheduled Queries and Materialized Views:

When should I use scheduled queries versus materialized views?

Scheduled queries are a convenient way to run arbitrarily complex calculations periodically. Each time the query runs, it is being run fully. The previous results are not used, and you pay the full price for the query. Scheduled queries are great when you don't need the freshest data and you have a high tolerance for data staleness.

Materialized views are suited for when you need to query the latest data while cutting down latency and cost by reusing the previously computed result. You can use materialized views as pseudo-indexes, accelerating queries to the base table without updating any existing workflows.

As a general guideline, whenever possible and if you are not running arbitrarily complex calculations, use materialized views.

rmesteves
  • 3,870
  • 7
  • 23
  • I totally missed Materialized Views. They are indeed a tradeoff hybrid. I, however, experimented with some clustering and partitioning while testing scheduled queries that store cleaned data, and they speeded up the charts in DataStudio quite dramatically. Will look in to Materialized Views. – nJGL Jul 23 '20 at 09:57
  • Actually. Material views 1) don't allow sorting, 2) don't allow DISTINCT selections, and 3) don't support JOINs, so I wonder if it is even possible to de-duplucate the stitch-data in a valid query. :( – nJGL Jul 28 '20 at 09:54
0

I think it might also be affected by how often the your view / table would be queried. For example - a very complex query over a large dataset will be costly every time it's run. If the result is a significantly smaller dataset, it will be more cost-effective to schedule a query to save the results, and query the results directly - rather than using a view which will perform the very complex query time and time again.

For the speed factor - it definitely is better to query a reduced table directly and not a view. For the cost factor - I would try to understand how often this view/table will be queried and how are the processing+storage costs for it:

  1. For a view: roughly calculate the processing costs * amount of times it will be queried monthly, for example
  2. For a stored table: scheduled queries performed per month * processing costs + monthly storage costs for the table results

This should give you pretty much the entire case you need to build in order to argue for your solution.