Our project runs on client's infrastructure wherein infra is managed via Kubernetes and Terraform. We automate our jobs using Airflow.
Any Airflow with DBT runs using KubernetesPodOperator provided in Airflow. We plan to create data lineage graphs for each client's tables.
I saw this link
How to setup dbt UI for data lineage?
and using the 2 below commands, I can generate the DBT data docs in my local machine.
dbt docs generage
dbt docs serve --port 8081
Now I need to generate the same at any client's location. So I have written the DAG as shown below:
sync_data_lineage = KubernetesPodOperator(namespace='etl',
image=f'blotout/dbt-analytics:{TAG_DBT_VERSION}',
cmds=["/usr/local/bin/dbt"],
arguments=['docs', 'generate'],
env_vars=env_var,
name="sync_data_lineage",
configmaps=['awskey'],
task_id="sync_data_lineage",
get_logs=True,
dag=dag,
is_delete_operator_pod=True,
)
deploy_lineage_graph = KubernetesPodOperator(namespace='etl',
image=f'blotout/dbt-analytics:{TAG_DBT_VERSION}',
cmds=["/usr/local/bin/dbt"],
arguments=['docs', 'serve', '--port', '8081'],
env_vars=env_var,
name="deploy_lineage_graph",
configmaps=['awskey'],
task_id="deploy_lineage_graph",
get_logs=True,
dag=dag,
is_delete_operator_pod=True,
)
sync_data_lineage >> deploy_lineage_graph
Now the first task runs successfully but when second one runs, it does not find catalog.json which is created by first task 'sync_data_lineage'. The reason being once the first DBT command runs and generates the catalog.json, the pod is destroyed. The second task runs in a second pod and hence not able to deploy docs as catalog.json in missing from first step.
How can I resolve this?