2

I have an integration test that compares the output from running the same scripts from 2 different branches (ie, master and a feature branch). Currently this test kicks off from my local machine, but I'd like to migrate it to a Databricks job, and run it entirely from the Workflows interface.

I'm able to recreate most of the existing integration test (written in Python) using notebooks and dbutils, with the exception of the feature branch checkout. I can make a call from my local machine to the Repos REST API to perform the checkout, but (from what I can tell) I can't make that same call from a job that's running on the Databricks cloud. (I run into credentials/authentication issues when I try, and my solutions are getting increasingly hacky.)

Is there a way to checkout a branch using pure Python code; something like a dbutils.repos.checkout()? Alternatively, is there a safe way to call the REST APIs from from a job that's running on the Databricks cloud?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
stevemn
  • 23
  • 2
  • the question needs sufficient code for a minimal reproducible example: https://stackoverflow.com/help/minimal-reproducible-example – D.L Sep 15 '22 at 14:24
  • This isn't a code question. This a feature question for a commercial product, one with many other submissions on SO. I am asking whether or not something is possible, making "sufficient code" utterly irrelevant – stevemn Sep 15 '22 at 14:33

2 Answers2

1

You can either use Repos REST API, specifically, the Update command of it. But in case of doing CI/CD, it's easier to use databricks repos update command of Databricks CLI, like this:

databricks repos update --path <path> --branch <branch>

P.S. I have end-to-end example of doing CI/CD for Repos + Notebooks on Azure DevOps, but approach will be the same for other systems. Here is an example of using Databricks CLI for checkout.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks Alex, I was hoping you'd spot this and pick it up. Your answers elsewhere have been very helpful. My ultimate solution might be an abuse of DBFS. I'm accessing Repos files and doing a copy with the following: `dbutils.fs.cp("file:/Workspace/Repos/...")`. Is this file path safe to use? i.e. is it okay to touch Repos files via `file:/Workspace/Repos`? – stevemn Oct 23 '22 at 02:17
  • Yes, you can use these files - it was specifically designed to provide config files, data files, etc. The only caveat *right now* is that is read-only, – Alex Ott Oct 23 '22 at 07:18
  • @AlexOtt I notice that the update can pull the specified branch but it does not discard changes that someone may had in the path. Is there any way to pull the branch and also discard any changes that someone may have done? – George Sotiropoulos Feb 06 '23 at 11:39
  • @AlexOtt I found the answer to my question (workaround), What one can actually do it, to delete the repo, create again, pull branch (if there is another than master). – George Sotiropoulos Feb 06 '23 at 11:52
  • Yes, but just take into account that this counts against databricks api throttling limits, but also GitHub and other cloud providers limits that are quite low (and you’ll be blocked during that time). Discard changes may come to api, but check with your Databricks representative – Alex Ott Feb 06 '23 at 13:34
0

Just for the record I give some code that you can execute in a notebook and "update" another repo folder and then execute it. I believe it does what the accepted answer says, by using the databricksapi within databricks notebook.

context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
url = context['extraContext']['api_url']
token = context['extraContext']['api_token']
from databricks_cli.repos.api import ReposApi
from databricks_cli.sdk.api_client import ApiClient
from databricks_cli.workspace.api import WorkspaceApi


api_client = ApiClient(
    host=url,
    token=token
)
repo_url = "https://yourhost@dev.azure.com/your_repo_url" # same as the one you use to clone
repos_path = "/Repos/your_repo/"
repos_api = ReposApi(api_client)
workspace_api = WorkspaceApi(api_client)


workspace_api.mkdirs(repos_path) # 1. create the initial folder if doesnt exist
# 2. Then if the repo already exists, delete it and create it again. That is, to ensure that you get the update branch you want. 
try: 
  repo_id = repos_api.get_repo_id(repos_path+ "your_repo")
  repos_api.delete(repo_id)
except RuntimeError:
  pass

repos_api.create(url=repo_url,  path=  repos_path+ "your_repo",  provider = 'azureDevOpsServices' )
repos_api.update(repo_id = repos_api.get_repo_id( repos_path+ "your_repo"),
                 branch='master', tag = None)

What it does:

  1. First connects using the context.
  2. Then deletes the target folder if exists
  3. creates and updates. (probably update is redundant)

I am deleting the existing folder o avoid conflicts with local changes. If someone made changes in the target Repo folder and you just update, you pull the changes from the origin but doesnt remove you changes existing there. With delete and create , it’s like resetting the folder.

In that way you can execute a script from another repo.

Alternatively, another way to do that is to create a job in databricks and use the databricksAPI to run it. However, you will have to create different job for each different notebook to be executed.

George Sotiropoulos
  • 1,864
  • 1
  • 22
  • 32