3

Databricks recently added support for "files in repos" which is a neat feature. It gives a lot more flexibility to the projects, since we can now add .json config files and even write custom python modules that exists solely in our closed environment.

However, I just noticed that the standard way of deploying from an Azure git repo to a workspace does not support arbitrary files. First off, all .py files are converted to notebooks, breaking the custom modules that we wrote for our project. Secondly, it intentionally skips files ending in one of the following: .scala, .py, .sql, .SQL, .r, .R, .ipynb, .html, .dbc, which means our .json config files are missing when the deployment is finished.

Is there any way to get around these issues or will we have to revert everything to use notebooks like we used to?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Bjarne Thorsted
  • 117
  • 1
  • 11

1 Answers1

3

You need to stop doing deployment the old way as it depends on the Workspace REST API that doesn't support arbitrary files. Instead you need to have a Git checkout in your destination workspace, and update that checkout to a given branch/tag when doing release. This is could be done via Repos API, or databricks cli. Here is an example of how to do that with cli from DevOps pipeline.

- script: |
    echo "Checking out the releases branch"
    databricks repos update --path $(STAGING_DIRECTORY) --branch "$(Build.SourceBranchName)"
  env:
    DATABRICKS_HOST: $(DATABRICKS_HOST)
    DATABRICKS_TOKEN: $(DATABRICKS_TOKEN)
  displayName: 'Update Staging repository'
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • I am unable to test out this solution at the moment, but it looks promising and simple enough. I will return and mark your answer as a solution once our team has had the chance to try it out and verified that it works. – Bjarne Thorsted Apr 01 '22 at 10:20
  • 1
    I'm doing that almost every day :-) – Alex Ott Apr 01 '22 at 10:43
  • So, can you elaborate a bit on this, please? Does this require a repo on the target workspace as well? We are having some trouble figuring out how to set the environment variables correctly – Bjarne Thorsted Apr 22 '22 at 08:58
  • yes, it requires Dataricks Repos in the target workspace. But you can create it using CLI as well – Alex Ott Apr 22 '22 at 09:42
  • When the target workspace is located on another Azure Subscription, what steps should we then take to make the example in your github work? We have an Azure AD PAT and we set up Git integration in the target environment to also use Azure DevOps (Personal Access Token). – Bjarne Thorsted Apr 22 '22 at 11:47
  • for that connection you only need databricks host & token, it doesn't matter much in which subscription it's located – Alex Ott Apr 22 '22 at 11:49
  • Okay, it's jsut that we keep getting the "Can’t find repo ID for /Repos/..." even though we are running a Standard SKU and as such don't have the IP Access Lists enabled. If we simply try to run `databricks repos list --path-prefix $(STAGING_DIRECTORY)` we get `Error: Authorization failed. Your token may be expired or lack the valid scope`. – Bjarne Thorsted Apr 22 '22 at 11:52
  • are you using correct token? – Alex Ott Apr 22 '22 at 15:11
  • I think so. We used a token generated from azure under the user menu. – Bjarne Thorsted Apr 22 '22 at 21:53
  • We basically followed this short guide: https://www.zachstagers.co.uk/p/connect-azure-databricks-to-a-devops-repo-in-a-different-tenancy/ – Bjarne Thorsted Apr 25 '22 at 08:42
  • I have instructions here: https://github.com/alexott/databricks-nutter-repos-demo – Alex Ott Apr 25 '22 at 08:58
  • We tried following that, but maybe we didn't understand how to do it properly. We created a PAT, but should we use the `azure-pipelines-devops.yml` as a template when using a DevOps repo? – Bjarne Thorsted Apr 25 '22 at 09:25
  • No, just `azure-devops.yaml` – Alex Ott Apr 25 '22 at 10:12
  • We couldn't get the databricks-cli to work, because apparently we needed both a databricks token and an Azure AD token via a service principal, but we managed to use the REST API via curl and it works now :) – Bjarne Thorsted Apr 27 '22 at 07:03
  • databricks-cli works just fine with AAD tokens – Alex Ott Apr 27 '22 at 07:13
  • Yes, but it does not support several concurrent tokens, which is what we needed. But now, we found out that `databricks repos update` works like a charm with just the databricks token, if we only use it as part of a DevOps Releases Pipeline rather than a general DevOps pipeline. – Bjarne Thorsted Apr 27 '22 at 07:36