2

I'm developing some python code that would be used as entry points for various wheel-based-workflows on Databricks. Given that it's under development, after I make code changes to test it, I need to build a wheel and deploy on Databricks cluster to run it (I use some functionality that's only available in Databricks runtime so can not run locally).

Here is what I do:

REMOTE_ROOT='dbfs:/user/kash@company.com/wheels'
cd /home/kash/workspaces/project
rm -rf dist

poetry build
whl_file=$(ls -1tr dist/project-*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying..'     && databricks fs cp --overwrite dist/$whl_file $REMOTE_ROOT
echo 'installing..'  && databricks libraries install --cluster-id 111-222-abcd \
                                                    --whl $REMOTE_ROOT/$whl_file
# ---- I WANT TO AVOID THIS as it takes time ----
echo 'restarting'    && databricks clusters restart --cluster-id 111-222-abcd

# Run the job that uses some modules from the wheel we deployed
echo 'running job..' && dbk jobs run-now --job-id 1234567

Problem is every time I make one line of change I need to restart the cluster which takes 3-4 minutes. And unless I restart the cluster databricks libraries install does not reinstall the wheel.

I've tried updating the version number for the wheel, but then it shows that the cluster has two versions of same wheel installed on the GUI (Compute -> Select-cluster -> Libraries-tab), but on the cluster itself the newer version is actually not installed (verified using ls -l .../site-packages/).

Kashyap
  • 15,354
  • 13
  • 64
  • 103
  • the job that you're running is notebook or an another wheel? – Alex Ott Aug 25 '22 at 15:53
  • @AlexOtt It's a wheel-based-workflow (pka "job"). See link in OP. – Kashyap Aug 25 '22 at 16:08
  • 1
    Unfortunately such library reinstallation behaviour is not supported on all-purpose clusters as documented [here](https://docs.databricks.com/libraries/cluster-libraries.html#update-a-cluster-installed-library). There are various options that could fit this requirement: * use `dbx execute` which install libraries in a notebook-scoped context which supports library reinstallation * use instance pools and run your tests on job clusters. – renardeinside Aug 26 '22 at 17:02

1 Answers1

0

What would perfectly suit your requirements is dbx by databricks labs.

Sure, you can look at their source code on Github and try to mimic the same in your code, but that would be way too much work when databricks-dbx (their execute command) already does this for you.

There you can keep making changes to your python code and run dbx execute -task=<the task that you define as a config while still developing in local IDE> --cluster-name=<your all purpose cluster name>

That would take care of creating a whl for it and deploy it to the cluster and start the job for you to test; while still being in your local IDE.

So , you can basically keep changing your whl in development and keep testing on the same running cluster (it will start it on if not running), without restarting , as it does this in a separate context -> See screenshot below from their documentation.

The main page of dbx is here.

This specific section within there, explains this functionality. enter image description here

I have just started using dbx and it does make these things very simple.

EDIT- Based on OP's comment about providing context for links (under the section between asterisks


Here, I have a whl that I have in development , that I call using a whl task in deployment.yml file in dbx

whl task in deployment.yml

I then test it using dbx execute on my adhoc interactive cluster. As you can see in the first screenshot, my cluster is terminated, so dbx execute, automatically starts it, uploads the whl and starts the job

First run of the whl task

I then make more changes to my python package and then again test the whl using dbx execute. As you can see below, the same cluster is used (this time the cluster was running , so it just used it, without restarting), upload the same version of the whl (the OP had this in his original question of being able to work with the same whl in development without updating the version or restarting the cluster)

2nd run of the same job

The OP, in his original question had concerns about being able to work with the same version of the whl without every time restarting an already running cluster, which like the 2 screenshots show, dbx addresses.

Regarding toolsets, though I have tried with the default ones from dbx, at least from dbx's documentation, poetry is supported (poetry was mentioned in the comment)-> poetry support


Saugat Mukherjee
  • 778
  • 8
  • 32
  • I've tried to use it. It doesn't work. It has a bunch of basic restrictions (e.g. must use ML runtime), which renders it useless, not to mention it expects that you use whatever toolset it recommends and throw out what you might have (poetry, pyproject.toml, ...). Perhaps in a few years it would do what this post needs. If you really think it works, then instead of posting links, "Provide context for links" stackoverflow.com/help/how-to-answer). – Kashyap Aug 30 '22 at 15:53
  • Thanks for the effort. Just FYI, teh author of `dbx` also tried to help (https://stackoverflow.com/questions/73490143/how-do-i-include-and-install-test-files-in-a-wheel-and-deploy-to-databricks). It simply doesn't work. Look at the edit history of his/her answer to see how many fixes THE author had to make. In the end it doesn't work. So yeah, on paper it promises the world but in reality it doesn't work. – Kashyap Sep 02 '22 at 13:37