4

I have two ML projects on Azure Databricks that work almost the same except that they are for different clients. Essentially I want to use some management system so I can share and reuse the same code across different projects. (i.e. python files that store helpful functions for feature engineering, Databricks notebooks that perform similar initial data preprocessing, some configuration files, etc.) At same time, if an update is made in the shared code, it needed to be sync with all the projects that use the code.

I know for Git we can use submodule to do this where we have common code stored in Repo C, and add it as a submodule to Repo A and Repo B. But the problem is that Azure Databricks doesn't support submodule. Also, it only supports working branch up to 200 MB, so I cannot do Monorepo (i.e. have all the code in one repository) either. I was thinking creating a package for shared Python files, but I also have a few core version of notebooks that I want to share which I don't think is possible to built as a package?

Is there any other ways that I can do this on Databricks so I can reuse the code and don't just copy and paste?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Jing Lin
  • 41
  • 1
  • are you expecting databricks notebooks migrations from One workspace to another workspace ? Can you please confirm on my understanding . – Karthikeyan Rasipalay Durairaj Feb 15 '22 at 16:27
  • @KarthikeyanRasipalayDurairaj No, I am actually working in Databricks Repos for Git integration with Azure DevOps service, but currently Databricks Repos doesn't support submodule so I cannot use this method to share code across projects. – Jing Lin Feb 15 '22 at 17:44
  • The git-subtree stuff could potentially be pressed into service here. I'm not a big fan of it because it's largely unmaintained and weird bugs come up now and then, but it might serve your needs. – torek Feb 15 '22 at 23:20

1 Answers1

3

At some point, the recommended solution from Databricks was to

  1. clone the common code repo to a seperate path /Workspace/Repos/<user-name>/<repo-name>
  2. add the above path to sys.path in the notebook that needs access to the common code repo with
import sys
sys.path.append("/Workspace/Repos/<user-name>/<repo-name>")

This will enable you to import python modules from the common code repo. Depending on the exact location of your module in the repo, you might need to change the path that you append to sys.path

fskj
  • 874
  • 4
  • 15
  • That is exactly what I used to do. But later I found out that it doesn't work with Job clusters. And there the only way seems to be building the shared code into a library and installing it on the cluster. Or is there a better way? – Oleg Oct 19 '22 at 08:50