I'm in the process of learning ins-and-outs of Airflow
to end all our Cron
woes. When trying to mimic failure of (CeleryExecutor
) worker
s, I've got stuck with Sensor
s. I'm using ExternalTaskSensor
s to wire-up top-level DAG
s together as described here.
My current understanding is that since Sensor
is just a type of Operator
, it must inherit basic traits from BaseOperator
. If I kill a worker
(the docker container
), all ordinary (non-Sensor
) task
s running on it get rescheduled on other worker
s.
However upon killing a worker
, ExternalTaskSensor
does not get re-scheduled on a different worker
; rather it gets stuck
Then either of following things happen:
- I just keep waiting for several minutes and then sometimes the
ExternalTaskSensor
is marked as failed but workflow resumes (it has happened a few times but I don't have a screenshot) - I stop all
docker container
s (including those runningscheduler
/celery
etc) and then restart them all, then the stuckExternalTaskSensor
gets rescheduled and workflow resumes. Sometimes it takes several stop-start cycles ofdocker container
s to get the stuckExternalTaskSensor
resuming again
Sensor
still stuck after single docker container
stop-start cycle
Sensor
resumes after several docker container
stop-start cycles
My questions are:
- Does
docker
have a role in this weird behaviour? - Is there a difference between
Sensor
s (particularlyExternalTaskSensor
) and otheroperator
s in terms of scheduling / retry behaviour? - How can I ensure that a
Sensor
is also rescheduled when theworker
it is running on gets killed?
I'm using puckel/docker-airflow with
Airflow 1.9.0-4
Python 3.6-slim
CeleryExecutor
withredis:3.2.7
This is the link to my code.