55

Are there any best practices that are followed for deploying new dags to airflow?

I saw a couple of comments on the google forum stating that the dags are saved inside a GIT repository and the same is synced periodically to the local location in the airflow cluster.
Regarding this approach, I had a couple of questions

  • Do we maintain separate dag files for separate environments? (testing. production)
  • How to handle rollback of an ETL to an older version in case the new version has a bug?

    Any help here is highly appreciated. Let me know in case you need any further details?

  • Olaf Kock
    • 46,930
    • 8
    • 59
    • 90
    Sreenath Kamath
    • 663
    • 1
    • 7
    • 17

    3 Answers3

    74

    Here is how we manage it for our team.

    First in terms of naming convention, each of our DAG file name matches the DAG Id from the content of the DAG itself (including the DAG version). This is useful because ultimately it's the DAG Id that you see in the Airflow UI so you will know exactly which file has been used behind each DAG.

    Example for a DAG like this:

    from airflow import DAG
    from datetime import datetime, timedelta
    
    default_args = {
      'owner': 'airflow',
      'depends_on_past': False,
      'start_date': datetime(2017,12,05,23,59),
      'email': ['me@mail.com'],
      'email_on_failure': True
    }
    
    dag = DAG(
      'my_nice_dag-v1.0.9', #update version whenever you change something
      default_args=default_args,
      schedule_interval="0,15,30,45 * * * *",
      dagrun_timeout=timedelta(hours=24),
      max_active_runs=1)
      [...]
    

    The name of the DAG file would be: my_nice_dag-v1.0.9.py

    • All our DAG files are stored in a Git repository (among other things)
    • Everytime a merge request is done in our master branch, our Continuous Integration pipeline starts a new build and packages our DAG files into a zip (we use Atlassian Bamboo but there's other solutions like Jenkins, Circle CI, Travis...)
    • In Bamboo we configured a deployment script (shell) which unzips the package and places the DAG files on the Airflow server in the /dags folder.
    • We usually deploy the DAGs in DEV for testing, then to UAT and finally PROD. The deployment is done with the click of a button in Bamboo UI thanks to the shell script mentioned above.

    Benefits

    1. Because you have included the DAG version in your file name, the previous version of your DAG file is not overwritten in the DAG folder so you can easily come back to it
    2. When your new DAG file is loaded in Airflow you can recognize it in the UI thanks to the version number.
    3. Because your DAG file name = DAG Id you could even improve the deployment script by adding some Airflow command line to automatically switch ON your new DAGs once they are deployed.
    4. Because every version of the DAGs is historicized in Git, we can always comeback to previous versions if needed.
    Alexis.Rolland
    • 5,724
    • 6
    • 50
    • 77
    • 1
      Hi Alexis, thanks for the clarification so in case there are any environment specific values like let's say URL in a HttpOperator how are these handled, you maintain separate dag files for each environment or use some configuration management system for same ? – Sreenath Kamath Jan 22 '18 at 16:43
    • 2
      Hello @SreenathKamath, for environment specific values we configure them in Airflow variables on their respective Airflow environment. You can find them in the menu under Admin > Variables. In your DAG, you can call these variables using `from airflow.models import Variable` and then `Variable.get('my_variable_name')` – Alexis.Rolland Jan 23 '18 at 03:12
    • @SreenathKamath please consider marking this question as solved if the answer satisfied you. Thanks – Alexis.Rolland Feb 09 '18 at 16:12
    • HI Alexis, when you say the previous version is not overwritten, do you mean it stays in dag folder? – gorros Jul 26 '19 at 07:24
    • @alexis but in that case you need to switch off that old DAG since it will continue to run – gorros Jul 28 '19 at 09:29
    • @gorros regardless of whether you replace the file in the DAG folder or not, you would still have to switch off the old version and switch on the new one in the UI. You can also do it with CLI. cf. my point 3 in the Benefits – Alexis.Rolland Jul 29 '19 at 15:27
    • 1
      @alexis I don't need to switch on if I just update file (dag) with the same name. – gorros Jul 30 '19 at 06:18
    • 1
      @alexis another question I have is how do you test you DAG files? – gorros Jul 30 '19 at 06:27
    • @gorros I usually use Airflow CLI to trigger DAGs or tasks in a DAG. in particular [trigger_dag](https://airflow.apache.org/cli.html#trigger_dag) and [run](https://airflow.apache.org/cli.html#run) – Alexis.Rolland Aug 05 '19 at 05:55
    • @Alexis.Rolland Reverting back to previous version is not something that happens very often. Given a dag code is maintained in git, so why do you need to keep the previous versions of same dag in airflow /dags directory. If we have 100's of DAGs to maintain, that means, every version of each dag will be shown in UI (with 1 enabled and rest disabled). This looks ugly in UI aswell as takes unnecessary space. – Anum Sheraz Oct 07 '21 at 19:42
    • @AnumSheraz at the time I wrote this answer it was necessary to update the `dag_id` to ensure Airflow detects the DAG has changed and reloads it. So the 100 DAGs example you mentioned was inevitable and any new version of a DAG would appear in the UI. Maybe this has changed as I haven't used Airflow for a while. Naming the actual DAG file with the `dag_id` is actually unrelated to that but just something we did for convenience to track the deployment of DAGs versions more easily. This is totally optional. Finally, one can also delete older DAG files / DAGs from the UI too (since v1.10) – Alexis.Rolland Oct 09 '21 at 08:24
    1

    As of yet, Airflow doesn't has its own functionality of versioning workflows (see this). However you can manage that on your own by managing DAGs on their own git repository and fetching its state into airflow reposiroty as submodules. In this way you always have single airflow version that contains sets of DAGs with specific versions. Watch more here

    Anum Sheraz
    • 2,383
    • 1
    • 29
    • 54
    1

    One best practice is written in the documentation:

    Deleting a task

    Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. It is advised to create a new DAG in case the tasks need to be deleted

    I believe this is why the versioning topic is not so easy to solve yet, and we have to plan some workarounds.

    https://airflow.apache.org/docs/apache-airflow/2.0.0/best-practices.html#deleting-a-task