0

I'm trying to model my ETL jobs with Airflow. All jobs have kind of the same structure:

  1. Extract from a transactional database(N extractions, each one reading 1/N of the table)
  2. Then transform data
  3. Finally, insert the data into an analytic database

So E >> T >> L

This Company Routine USER >> PRODUCT >> ORDER has to run every 2 hours. Then I will have all the data from users and purchases.

How can I model it?

  • The Company Routine(USER >> PRODUCT >> ORDER ) must be a DAG and each job must be a separate Task? In this case, how can I model each step(E, T, L) inside the task and make them behave like "sub-tasks" in Airflow?
  • Or each job is a separate DAG? In this case. How can I say that I have to run The Company Routine(USER >> PRODUCT >> ORDER ) every 2h and they have these dependencies. Because as I could see, we can set cron time and dependencies only between tasks inside a DAG.

Diagram:

Diagram

Now I'm using each Company Routine(USER >> PRODUCT >> ORDER ) as DAG and each job must be a separate Task.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65

1 Answers1

0

2nd option is better (have each sub-workflow of Company Routine as a top-level DAG) because

  • top-level DAGs can be re-run independently (in case just one of them needs to be rerun) while you cannot rerun just a part of a DAG (if you modelled them as a monolithic DAG)
  • same holds true for backfilling

But then you must link-up those top-level DAGs together too (so that they run one-after another). For that, see Wiring top-level DAGs together

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131