4

I have a DAG described like this :

tmpl_search_path = '/home/airflow/gcs/sql_requests/'

with DAG(dag_id='pipeline', default_args=default_args, template_searchpath = [tmpl_search_path]) as dag:

    create_table = bigquery_operator.BigQueryOperator(
        task_id = 'create_table',
        sql = 'create_table.sql',
        use_legacy_sql = False,
        destination_dataset_table = some_table)
    )

The task create_table calls a SQL script create_table.sql. This SQL script is not in the same folder as the DAG folder : it is in a sql_requests folder at the same level as the DAG folder. This is the architecture inside the bucket of the GCP Composer (which is the Google Airflow) is :

bucket_name
|- airflow.cfg
|- dags
   |_ pipeline.py
|- ...
|_ sql_requests
   |_ create_table.sql

What path do I need to set for template_searchpath to reference the folder sql_requests inside the Airflow bucket on GCP ?

I have tried template_searchpath= ['/home/airflow/gcs/sql_requests'], template_searchpath= ['../sql_requests'], template_searchpath= ['/sql_requests'] but none of these have worked.

The error message I get is 'jinja2.exceptions.TemplateNotFound'

Valentin Richer
  • 680
  • 1
  • 12
  • 24

4 Answers4

2

According to https://cloud.google.com/composer/docs/concepts/cloud-storage it is not possible to store files that are needed to execute dags elsewhere than in the folders dags or plugins :

To avoid a workflow failure, store your DAGs, plugins, and Python modules in the dags/ or plugins/ folders—even if your Python modules do not contain DAGs or plugins.

This is the reason why I had the TemplateNotFound error.

Valentin Richer
  • 680
  • 1
  • 12
  • 24
2

You can store in mounted/known paths which are dags/plugins OR data

data folder has no capacity limits but it's easy to throw yourself off using it to store anything that the web server would need to read, because the web server can't access that folder (e.g if you put SQL files in /data folder, you would not be able to parse rendered template in the UI, but any tasks that need to access the file during run time would just run fine)

the pillow
  • 412
  • 6
  • 14
2

Change 'sql_requests' folder into the 'dag' folder so that your code will be like this:

tmpl_search_path = '/home/airflow/dags/sql_requests/'  
with DAG(dag_id='pipeline', default_args=default_args, template_searchpath = [tmpl_search_path]) as dag:      
    create_table = bigquery_operator.BigQueryOperator(
        task_id = 'create_table',
        sql = 'create_table.sql',
        use_legacy_sql = False,
        destination_dataset_table = some_table
    )
)

For me, it works!

Marcelo Vinicius
  • 805
  • 10
  • 8
0

I believe by default the operator looks for sql files in the DAG folder, so you could put your SQL into the folder

gs://composer-bucket-name/dags/sql/create_table.sql

And then reference it as

sql = '/sql/create_table.sql'

If that doesn't work, try it without the leading / (which I'm not sure you need)

Edit

If you want to put them in a folder at the root of the bucket, try

sql = '../sql/create_table.sql'
Adam Hopkinson
  • 28,281
  • 7
  • 65
  • 99
  • Thank you for your answer. I have done this and it worked. But I would like to know if this is possible to put my sql requests in a folder other than in the dag folder, like `gs://composer-bucket-name/sql/create_table.sql`. I tried to follow the same pattern but it did not work and I have no idea why. – Valentin Richer Jun 07 '19 at 13:39
  • Thank for your update, but unfortunately I tried and it did not work, I don't know why – Valentin Richer Jun 11 '19 at 08:30