I use a BigQuery dataset as data lake to store all records/events level data, and a SQL server to store aggregated reports that are updated regularly. Because the reports will be accessed frequently by clients via web interface, and each report aggregates large amount of data, so storing it BigQuery is a no go.
What is the best practise for doing this? Internally we have 2 ideas running around:
Run a Dataflow batched job every X hr to recalculate the aggregation and update the SQL server. It will need a scheduler to trigger the job, and the same job can be used to backfill all data.
Run an Airflow job that does the same thing. A separate job will be needed for backfill (but can still share most of the code with the regular job)
I know Dataflow does well in terms of processing chunks of data in parallel, but I wonder about Airflow's performance, as well as the risk of exhausting connection limit