I build statistical output generated on-demand from data stored in BigQuery tables. Some data is imported daily via stitch using "Append-Only". This results in duplicated observations in the imported tables (around 20kk rows growing 8kk yearly).
I could either schedule a BigQuery query to store deduplicated values in a cleaned table, or build views to do the same, but I don't understand the tradeoffs in terms of:
- costs on BigQuery for storing/running scheduled queries and views.
- speed of later queries dependent on deduplicated views. Do the views cache?
Am I correct to assume that daily scheduled queries to store deduplicated data is more costly (for re-writing stored tables) but speeds up later queries to the deduplicated data (saving on costs for usage) ?
The deduplicated data will namely in turn be queried hundreds of times daily to produce dashboard output for which responsiveness is a concern.
How should I argue when deciding for the better solution?