Dump materialize aggregation from BigQuery to SQL server, Dataflow vs Airflow

Question

I use a BigQuery dataset as data lake to store all records/events level data, and a SQL server to store aggregated reports that are updated regularly. Because the reports will be accessed frequently by clients via web interface, and each report aggregates large amount of data, so storing it BigQuery is a no go.

What is the best practise for doing this? Internally we have 2 ideas running around:

Run a Dataflow batched job every X hr to recalculate the aggregation and update the SQL server. It will need a scheduler to trigger the job, and the same job can be used to backfill all data.
Run an Airflow job that does the same thing. A separate job will be needed for backfill (but can still share most of the code with the regular job)

I know Dataflow does well in terms of processing chunks of data in parallel, but I wonder about Airflow's performance, as well as the risk of exhausting connection limit

From what i understand, and taking in consideration your use case, your idea of connecting Bigquery with a Dataflow job to update the Cloud SQL data seems like the way to go. What would be your question per se?. — Mayeru, Jul 31 '19 at 10:24

score 1 · Accepted Answer · answered Aug 01 '19 at 11:06

1

Please check this answer from a previous similar question

In conclusion: Using Airflow will result in a more efficient way to manage all the process from the workflow. A solution that Google offers based on Airflow is Cloud Composer.

answered Aug 01 '19 at 11:06

Mayeru

1,044
7
12

Additional question: is airflow good for loading large amount of data and doing some simple process per row? I've seen a lot of talks on using BigQuery or csv as staging place to avoid loading data directly into Airflow, but for the use case I want to avoid creating staging resources as much as possible – xiu shi Aug 01 '19 at 14:33
Airflow helps you parallelize the different tasks to process/transfer the data. But you would still need BigQuery to store the data. – Mayeru Aug 02 '19 at 11:45

Dump materialize aggregation from BigQuery to SQL server, Dataflow vs Airflow

1 Answers1