0

Our project runs on client's infrastructure wherein infra is managed via Kubernetes and Terraform. We automate our jobs using Airflow.

Any Airflow with DBT runs using KubernetesPodOperator provided in Airflow. We plan to create data lineage graphs for each client's tables.

I saw this link

How to setup dbt UI for data lineage?

and using the 2 below commands, I can generate the DBT data docs in my local machine.

dbt docs generage
dbt docs serve --port 8081

Now I need to generate the same at any client's location. So I have written the DAG as shown below:

sync_data_lineage = KubernetesPodOperator(namespace='etl',
                                                 image=f'blotout/dbt-analytics:{TAG_DBT_VERSION}',
                                                 cmds=["/usr/local/bin/dbt"],
                                                 arguments=['docs', 'generate'],
                                                 env_vars=env_var,
                                                 name="sync_data_lineage",
                                                 configmaps=['awskey'],
                                                 task_id="sync_data_lineage",
                                                 get_logs=True,
                                                 dag=dag,
                                                 is_delete_operator_pod=True,
                                                 )

    deploy_lineage_graph = KubernetesPodOperator(namespace='etl',
                                              image=f'blotout/dbt-analytics:{TAG_DBT_VERSION}',
                                              cmds=["/usr/local/bin/dbt"],
                                              arguments=['docs', 'serve', '--port', '8081'],
                                              env_vars=env_var,
                                              name="deploy_lineage_graph",
                                              configmaps=['awskey'],
                                              task_id="deploy_lineage_graph",
                                              get_logs=True,
                                              dag=dag,
                                              is_delete_operator_pod=True,
                                              )

sync_data_lineage >> deploy_lineage_graph

Now the first task runs successfully but when second one runs, it does not find catalog.json which is created by first task 'sync_data_lineage'. The reason being once the first DBT command runs and generates the catalog.json, the pod is destroyed. The second task runs in a second pod and hence not able to deploy docs as catalog.json in missing from first step.

How can I resolve this?

halfer
  • 19,824
  • 17
  • 99
  • 186
azaveri7
  • 793
  • 3
  • 18
  • 48
  • I am not using any of this (neither airflow nor k8s), but a wild guess: would running `dbt docs generate && dbt docs serve --port 8081` (so both commands in the same pod) fix this, if the issue is related to missing the `catalog.json`? – Aleix CC Mar 13 '23 at 16:24

1 Answers1

0

Try saving DBT artifacts on S3 or other external storage.

gunn
  • 316
  • 2
  • 6