0

I have a DataBricks notebook (Spark - python) that reads from S3 and after doing some ETL work, writes results to S3. Now I want to run this code on a schedule as a .py script, not from a notebook. The reason I am looking to run a python script is that it makes the versioning easier

I understand I need to create a job in Databricks that runs on a schedule. But looks like a Databricks job can only run a JAR (scala) or a notebook. I don't see a way to run a python script there.

Am I missing something?

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
nad
  • 2,640
  • 11
  • 55
  • 96
  • Where is this python script developed? Locally? If yes, you can have a jenkins pipeline that converts .py script to Ipyhton notebook and writes to DBFS so that it can be scheduled as a regular Python notebook job. You can do this in your script itself if you want to: https://stackoverflow.com/questions/23292242/converting-to-not-from-ipython-notebook-format – Sai Nov 06 '20 at 05:49
  • @Sai no. I was basically writing the ETL in a python notebook in Databricks for testing and analysis purposes. But eventually, I want to take the code out of the notebook to a .py for maintainability and versioning purposes. – nad Nov 06 '20 at 05:54
  • By default DB has version control: https://docs.databricks.com/notebooks/github-version-control.html – Sai Nov 06 '20 at 05:59

1 Answers1

2

Unfortunately, this functionality is not currently available in the Databricks UI, but it is accessible via the REST API. You'll want to use the SparkPythonTask data structure.

You'll find this example in the official documentation - Jobs API examples.

curl -n -X POST -H 'Content-Type: application/json' -d \
'{
  "name": "SparkPi Python job",
  "new_cluster": {
    "spark_version": "7.3.x-scala2.12",
    "node_type_id": "Standard_D3_v2",
    "num_workers": 2
  },
  "spark_python_task": {
    "python_file": "dbfs:/pi.py",
    "parameters": [
      "10"
    ]
  }
}' https://<databricks-instance>/api/2.0/jobs/create

OR

You can execute Jars and Python scripts on Azure Databricks using Data Factory.

Reference: Execute Jars and Python scripts on Azure Databricks using Data Factory

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42