1

At the moment we schedule our Databricks notebooks using Airflow. Due to dependencies between projects, there are dependencies between DAGs. Some DAGs wait until a task in a previous DAG is finished before starting (by using sensors). We are now looking to use Databricks DBX. It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX. My question is now, is it possible to add dependencies between Databricks jobs? Can we create 2 different jobs using DBX, and make the second job wait until the first one is completed.

I am aware that I can have dependencies between tasks in one job, but in our case it is not possible to have only one job with all the tasks.

I was thinking about adding a notebook/python script before the wheel with ETL logic. This notebook would check then if the previous job is finished. Once this is the case, the task with the wheel will be executed. Does this make sense, or are there better ways? Is something like the ExternalTaskSensor in Airflow available within Databricks workflows? Or is there a good way to use DBX without DB workflows?

gamezone25
  • 288
  • 2
  • 10

1 Answers1

0

author of dbx here.

TL;DR - dbx is not opinionated in terms of the orchestrator choice.

It is still new for us, but it seems that DBX' main added value is when you use Databricks workflows. It would be possible to run a Python wheel in a job that was created by DBX.

The short answer is yes, but it's done on the tasks level (read more here on the difference between workflow and task).

Another approach would be the following - if you still need (or want) to use Airflow, you can do it in the following way:

  1. Deploy and update your jobs from your CI/CD pipeline with dbx deploy commands.
  2. In Airflow, use the Databricks Operator to launch the job (either by name or by id).
renardeinside
  • 377
  • 1
  • 9
  • Thanks for the answer. We also though about your proposed way already. However, we would prefer to not use Databricks jobs at all, as we don't want to split our orchestration over Databricks and Airflow. Is a scenario possible where we use `dbx execute` for active development and deploy a Python wheel (on a data lake or somewhere else) and execute this wheel directly through Airflow? Or is this not recommended? – gamezone25 Jan 30 '23 at 09:35
  • This approach is definitely not recommended. Reasoning is the following: dbx execute runs on the all-purpose clusters (also sometimes called as interactive clusters). Interactive clusters are not suitable for prod-like workloads. Therefore logic is the following: use dbx execute from local machine for interactive runs. Use dbx deploy to deploy the jobs, and then orchestrate these jobs using Airflow. Differences between all-purpose and interactive are described here - https://dbx.readthedocs.io/en/latest/concepts/cluster_types/ – renardeinside Jan 30 '23 at 09:39
  • We just use the all-purpose clusters for development. Once development is done, we would deploy a Python wheel and run it directly with Airflow on a DB job cluster (just like a notebook that is triggered using Airflow). – gamezone25 Jan 30 '23 at 09:42