Sensor won't be re-scheduled on worker failure

Question

I'm in the process of learning ins-and-outs of Airflow to end all our Cron woes. When trying to mimic failure of (CeleryExecutor) workers, I've got stuck with Sensors. I'm using ExternalTaskSensors to wire-up top-level DAGs together as described here.

My current understanding is that since Sensor is just a type of Operator, it must inherit basic traits from BaseOperator. If I kill a worker (the docker container), all ordinary (non-Sensor) tasks running on it get rescheduled on other workers.

However upon killing a worker, ExternalTaskSensor does not get re-scheduled on a different worker; rather it gets stuck

Then either of following things happen:

I just keep waiting for several minutes and then sometimes the ExternalTaskSensor is marked as failed but workflow resumes (it has happened a few times but I don't have a screenshot)
I stop all docker containers (including those running scheduler / celery etc) and then restart them all, then the stuck ExternalTaskSensor gets rescheduled and workflow resumes. Sometimes it takes several stop-start cycles of docker containers to get the stuck ExternalTaskSensor resuming again

Sensor still stuck after single docker container stop-start cycle

Sensor resumes after several docker container stop-start cycles

My questions are:

Does docker have a role in this weird behaviour?
Is there a difference between Sensors (particularly ExternalTaskSensor) and other operators in terms of scheduling / retry behaviour?
How can I ensure that a Sensor is also rescheduled when the worker it is running on gets killed?

I'm using puckel/docker-airflow with

Airflow 1.9.0-4
Python 3.6-slim
CeleryExecutor with redis:3.2.7

This is the link to my code.

All `operator`s in my *workflow* have `trigger_rule = 'all_done'`, so *workflow* is able to resume even after `ExternalTaskSensor` are marked *failed* — y2k-shubham, Jul 21 '18 at 11:45

Sensor won't be re-scheduled on worker failure

0 Answers0