2

I setup the dag from the https://airflow.apache.org/tutorial.html as is, the only change being that I have set the dag to run at an interval of 5 minutes with a start date of 2017-12-17 T13:40:00 UTC. I enabled the dag before 13:40, so there was no backfill and my machine is running on UTC. The dag ran as expected(i.e at an interval of 5 minutes starting at 13:45 UTC)

Now, when I go to the tree view, I am failing to understand the graph. There are 3 tasks in total. 'sleep'(t2) has upstream set to 'printdate' (t1) and 'templated'(t3) too has upstream set to 'printdate'(t1). Then why is the graph showing two 'printdate's ?? Are they separate task instances of that task? If yes, then how do I make sure that only 1 task instance of t1 runs (diamond pattern). There are also 4 green rectangular boxes(with two 'printdate's), instead of 3.

# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag)

t2 = BashOperator(
    task_id='sleep',
    bash_command='sleep 5',
    retries=3,
    dag=dag)

templated_command = """
    {% for i in range(5) %}
        echo "{{ ds }}"
        echo "{{ macros.ds_add(ds, 7)}}"
        echo "{{ params.my_param }}"
    {% endfor %}
"""

t3 = BashOperator(
    task_id='templated',
    bash_command=templated_command,
    params={'my_param': 'Parameter I passed in'},
    dag=dag)

t2.set_upstream(t1)
t3.set_upstream(t1)

Second, why is the time above the dag runs (green circles), showing 8.40, 8.45 - ? What time/timezone is that? I have set start_date for the dag to 13.40 and my machine set to UTC.

enter image description here

soupybionics
  • 4,200
  • 6
  • 31
  • 43
  • Those who think `sleep` & `templated` should have *branched out* of *single* `print_date` task (in **Tree-View**) should read [this](https://stackoverflow.com/questions/45362880/) – y2k-shubham Jul 06 '18 at 05:37

2 Answers2

4

They are not separate instances. You can see this:

  1. In Tree View, the start/end dates and duration of both circles will be exactly the same.

  2. In Gantt view, you will see the duration for only a single instance of print_date.

In general, you can't map a DAG to a tree view without duplicating nodes like they've done.

Dmitriy
  • 5,525
  • 12
  • 25
  • 38
dgn
  • 56
  • 3
-3

1.Yes, they are separate tasks. To make sure that there is only one print_date, you can do:

t1 >> t2 >> t3

instead of

t2.set_upstream(t1)
t3.set_upstream(t1) 

You can change the order as per your workflow.

2.On my machine, those green dots display the time of the scheduled run in UTC. Are you sure that's not in your database timezone?

x97Core
  • 1,454
  • 13
  • 20