Im am building a Spark job to migrate 6 years of ecommerce orders / events into our new BigQuery warehouse to complement the new streaming capability. The migrated data is in the same model as the new data and views will expose parts as needed.
We have 4 datasets:
data (raw data - hierarchical JSON) views (like DIMS over the raw data - generally flat but not always) pre_aggs (e.g currency covverting raw totals with some key rollup dims included - flat) reports (as you would think)
Unlike some of the new live data which is streamed in and unbounded, the migration of this historical data is batch and bounded (I dont have to concern myself with late arriving events / watermarking and duplicates for example). I can also partition the data manually via event time (orderPlacedTimestamp) and persist in the correct date partitioned table (suffix?). The full data is about 2 GBs and 2million rows compressed in BQ, so not massive but fairly complex structures with unnesting needed in the view layer. I have the options to write the raw and aggregates as materialsied tables from Spark, so really wanting to do this in-line with best practice and optimise for performance (query speed is more important and worth paying a bit extra for)
I found this really good blog on the SQL/Query lambda architecture which is some inspiration and will attempt to do similarly.
https://www.linkedin.com/pulse/building-robust-real-time-etl-google-bigquery-lambda-novozhilov
Im still wondering though how best to store / partition this data, and then construct time based queries to match. Week on week, Month on month reports are likely to be the most common.
My options seem:
Everything in one table - seems nice and simple no ongoing table management over time, but means a full scan for every query when often I just want to just go back a year or 2 at most from a point in time?
One table per time segment e.g. yearly, monthly
order_history_2017 or order_history_201701
We have a calendar lookup dim, which each row will have keys which the suffix above could be used — e.g 201701 for Jan 2017.
Monthly would mean 72 tables which seems quite a bit, maybe yearly is better?
For argument sake, say its monthly tables, what is the best way in BQ (standard SQL) to then query the right prefix of tables containing a contiguous timeline, constructing the table name on the fly with the right suffix (dynamically perhaps?
e.g say I want to query all orders (order has an orderPlacedTimestamp) between 2017–01-10 and 2017-02-10 - this would mean scanning (and union?) only the order_history_2017–01-10 and order_history_2017-02-10 tables in this case, the doing a BETWEEN like below:
SELECT *
FROM order_history_201701 UNION ALL
SELECT *
FROM order_history_201702
WHERE order.orderPlacedTimestamp BETWEEN DATE(“2017–01-10”) and DATE(“2017-02-10”)
I might then get scenarios where this historic data needs unioning with the ‘live’ (streaming) data too - wrapped up in a view like the article on the lambda design.
- An option I haven’t thought of.
So many options with BQ! :)
Anyway, that is my current thinking, any words of wisdom on this topic would be hugely appreciated in relation to table design and optimised query construction.
Thanks heaps all!