I have a series of Airflow tasks that require some fan-out and collection of results for bulk processing, and I'm having a hard time visualizing how it should work.
Roughly, I fetch a list of files, process them individually through a series of transformation tasks, then load them into a database.
Fetch task(s) overview
- Fetch list of JSON files to download
- Download each JSON file
- For each file begin "processing workflow"
"Processing workflow" task(s) overview
- Parse JSON file
- Reshape JSON data
- Run suite of (stateless) error correction functions on reshaped JSON data
- Insert JSON data into database
- Run suite DB-level functions on that just-inserted data
- Run more DB-level functions on data from step 5
It's unclear how to, for example, begin all of the "processing workflow" tasks for each file from within a single task. Should bulk tasks like this each be a sub-DAG of tasks? How would you model this?