I have a long running task where it loops through calling some REST endpoint to get data maybe hundreds of times and could take up to 1 hour.
While the task is still running, I see there are multiple attempts or retries, even though I specifically put retries=0
on this PythonOperator that calls this particular function.
As part of the default_args
I pass into the dag, it also has "retries": 0
.
So I'm not sure what is retrying the task even while it is still running and no error whatsoever in the logs.
In the airflow.cfg
file I have changed the following settings:
job_heartbeat_sec = 3600
scheduler_heartbeat_sec = 3600
scheduler_zombie_task_threshold = 3600
task_timeout=3600
But the symptom persists.
What else could be doing this?
The time between the retries as shown in the logs:
[2023-02-24, 02:32:12 UTC] {{taskinstance.py:1363}} INFO - Starting attempt 1 of 1
[2023-02-24, 02:34:30 UTC] {{taskinstance.py:1363}} INFO - Starting attempt 2 of 1
[2023-02-24, 02:39:31 UTC] {{taskinstance.py:1363}} INFO - Starting attempt 3 of 1
The retry between attempts 2 and 3 (and sometimes if it retries more than 3 times, between 3 and 4, and 4 and 5...etc) is exactly 5 minutes. I searched for "300" (seconds) in my airflow.cfg and only found dag_dir_list_interval
is set to 300, which doesn't seem related. Also, I'm not sure why the retry between 1 and 2 is some seemingly random length, 2 minutes and 18 seconds (i.e. not 5 minutes).