0

I want to use airflow for image processing.

I have 4 Tasks: Image Pre process (A) ,bounding box finder (B), classification (C), image finalize (D).

the chart look like this:

A -> B1 -> C  \
  -> B2 -> C  -   D
  -> B3 -> C  /
  -> Bn -> C /

the output of Image Pre process task is a list of bounding box proposals, for each bounding box I run classification and once all classification tasks ends I run the image finalize.

I want everything to run in parallel

This will run on 10000 images per day so if I will have different presentation of pipeline in the UI for each image, I can't keep track of the pipeline...

Is it possible in airflow ?

asaf
  • 958
  • 1
  • 16
  • 38
  • I used to tried scaled operator inside of task. but no try scaled task. if you not mind visualization, you and try initial operators based on Pre-task result. The result of tasks could be transfer by xcom at task level or variable at airflow level. – Yong Wang May 06 '19 at 14:17

1 Answers1

1

Dynamically creating tasks like this is not something Airflow is best for. Take a look at the answer here to get some insight: Airflow dynamic tasks at runtime. Airflow is better suited as a scheduling tool, so I propose you delegate the actual work and parallelization to another tool like Celery. You can still use Airflow to schedule this work, in a way that your B step is a simple operator which reads the output from A (via XCom or similar) and distributes actual work to some remote workers.

Can you know in advance the maximum possible number of B tasks? If that's manageable, you could get away with creating the max B tasks, and then skipping some of them as needed depending on the outcome of A. The implementation might not be trivial, but you could get some hints from this discussion: Launch a subdag with variable parallel tasks in airflow.

bosnjak
  • 8,424
  • 2
  • 21
  • 47