How to run a Spark (python) ETL pipeline on a schedule in Databricks

Question

I have a DataBricks notebook (Spark - python) that reads from S3 and after doing some ETL work, writes results to S3. Now I want to run this code on a schedule as a .py script, not from a notebook. The reason I am looking to run a python script is that it makes the versioning easier

I understand I need to create a job in Databricks that runs on a schedule. But looks like a Databricks job can only run a JAR (scala) or a notebook. I don't see a way to run a python script there.

Am I missing something?

Where is this python script developed? Locally? If yes, you can have a jenkins pipeline that converts .py script to Ipyhton notebook and writes to DBFS so that it can be scheduled as a regular Python notebook job. You can do this in your script itself if you want to: https://stackoverflow.com/questions/23292242/converting-to-not-from-ipython-notebook-format — Sai, Nov 06 '20 at 05:49
@Sai no. I was basically writing the ETL in a python notebook in Databricks for testing and analysis purposes. But eventually, I want to take the code out of the notebook to a .py for maintainability and versioning purposes. — nad, Nov 06 '20 at 05:54
By default DB has version control: https://docs.databricks.com/notebooks/github-version-control.html — Sai, Nov 06 '20 at 05:59

score 2 · Accepted Answer · answered Nov 06 '20 at 09:18

Unfortunately, this functionality is not currently available in the Databricks UI, but it is accessible via the REST API. You'll want to use the SparkPythonTask data structure.

You'll find this example in the official documentation - Jobs API examples.

curl -n -X POST -H 'Content-Type: application/json' -d \
'{
  "name": "SparkPi Python job",
  "new_cluster": {
    "spark_version": "7.3.x-scala2.12",
    "node_type_id": "Standard_D3_v2",
    "num_workers": 2
  },
  "spark_python_task": {
    "python_file": "dbfs:/pi.py",
    "parameters": [
      "10"
    ]
  }
}' https://<databricks-instance>/api/2.0/jobs/create

OR

You can execute Jars and Python scripts on Azure Databricks using Data Factory.

Reference: Execute Jars and Python scripts on Azure Databricks using Data Factory

How to run a Spark (python) ETL pipeline on a schedule in Databricks

1 Answers1