1

I'm in the process of learning ins-and-outs of Airflow to end all our Cron woes. When trying to mimic failure of (CeleryExecutor) workers, I've got stuck with Sensors. I'm using ExternalTaskSensors to wire-up top-level DAGs together as described here.

My current understanding is that since Sensor is just a type of Operator, it must inherit basic traits from BaseOperator. If I kill a worker (the docker container), all ordinary (non-Sensor) tasks running on it get rescheduled on other workers.


However upon killing a worker, ExternalTaskSensor does not get re-scheduled on a different worker; rather it gets stuck Stuck ExternalTaskSensor

Logs of ExternalTaskSensor after worker was killed

Then either of following things happen:

  • I just keep waiting for several minutes and then sometimes the ExternalTaskSensor is marked as failed but workflow resumes (it has happened a few times but I don't have a screenshot)
  • I stop all docker containers (including those running scheduler / celery etc) and then restart them all, then the stuck ExternalTaskSensor gets rescheduled and workflow resumes. Sometimes it takes several stop-start cycles of docker containers to get the stuck ExternalTaskSensor resuming again

stuck sensor

Sensor still stuck after single docker container stop-start cycle

stuck sensor resumes after docker container restart

Sensor resumes after several docker container stop-start cycles


My questions are:

  • Does docker have a role in this weird behaviour?
  • Is there a difference between Sensors (particularly ExternalTaskSensor) and other operators in terms of scheduling / retry behaviour?
  • How can I ensure that a Sensor is also rescheduled when the worker it is running on gets killed?

I'm using puckel/docker-airflow with

  • Airflow 1.9.0-4
  • Python 3.6-slim
  • CeleryExecutor with redis:3.2.7

This is the link to my code.

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • All `operator`s in my *workflow* have `trigger_rule = 'all_done'`, so *workflow* is able to resume even after `ExternalTaskSensor` are marked *failed* – y2k-shubham Jul 21 '18 at 11:45

0 Answers0