Airflow seems to be losing track of tasks and considering them failed

Question

I'm using cloud composer composer-1.10.6-airflow-1.10.6

There seems to be a problem with airflow starting tasks and losing track of them. The logs stop, and eventually it gets marked as failed, but actually the task completes successfully. This is a big problem for non-idempotent tasks (like tasks that append data) if we have retries configured. Whenever this happens I have to go and manually investigate if the task actually did complete or not, and mark the job accordingly.

Here's an example log. It's a typical log from this problem. Not much useful info in it, basically they just look like they end prematurely (hence It feels like airflow as lost track of the job). Meanwhile the job has still completed successfully.

*** Reading remote log from gs://bucket/log/path/log.log
[2020-08-31 12:16:33,450] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: builder.launch_loader_prd 2020-08-30T10:30:00+00:00 [queued]>
[2020-08-31 12:16:33,569] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: builder.launch_loader_prd 2020-08-30T10:30:00+00:00 [queued]>
[2020-08-31 12:16:33,571] {taskinstance.py:841} INFO -
--------------------------------------------------------------------------------
[2020-08-31 12:16:33,572] {taskinstance.py:842} INFO - Starting attempt 1 of 1
[2020-08-31 12:16:33,572] {taskinstance.py:843} INFO -
--------------------------------------------------------------------------------
[2020-08-31 12:16:33,605] {taskinstance.py:862} INFO - Executing <Task(DataflowTemplateOperator): launch_loader_prd> on 2020-08-30T10:30:00+00:00
[2020-08-31 12:16:33,608] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'builder', 'launch_loader_prd', '2020-08-30T10:30:00+00:00', '--job_id', '449104', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mydag.py', '--cfg_path', '/tmp/tmpmzlheavp']

Any help is appreciated. Thanks.

score 0 · Answer 1 · answered Aug 31 '20 at 21:15

The log you shared seems to be the expected log of a task starting. If the task fail and it's marked as failed after some time, I'd check if there are logs similar to the following:

Executor reports task instance <TaskInstance: YOUR_TASK 2020-08-31 10:30:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?.

Check for these kind of logs in Stackdriver to get more information on why the task was marked as failed.

Now, regarding the issue itself, I can see that you're executing a DataflowTemplateOperator, what might be happening is that either the task is unable to get back the status from Dataflow because of some issue (see the example) and it fails, or, the task eventually times out waiting for the job to finish and it is marked as failed (if this is the case you will see the "Timeout" message in the logs. The solution is to increase the timeout for the tasks execution).

In both cases, my suggestion is to look for the logs directly in Stackdriver to get more information about your task.

If the issue is not only related to the DataflowTemplateOperator but to any task, then I would check the overall status of the cluster to see if there are any errors with pod evictions or similar issues.

Airflow seems to be losing track of tasks and considering them failed

1 Answers1