2

When installing libraries directly in Databricks notebook cells via %pip install, the Python interpreter gets restarted. I have the understanding that in order for the newly installed packages to become visible and accessible to the rest of the notebook cells, the interpreter must be restarted.

My question: How would it be possible to perform this interpreter restart programatically?

I am installing packages using a function call, based on requirements stored on a separate file, but I noticed that newly installed packages are not present, despite the installation seemingly taking place, either on notebook or cluster scope. I figured out that the reason might be the lack of interpreter restart from my installation code.

Is such a behavior possible?

lazarea
  • 1,129
  • 14
  • 43

2 Answers2

3

you can use this:

dbutils.library.restartPython()

for more info look at https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils#dbutils-library-restartpython

Akmal Miah
  • 39
  • 2
0

Firstly, you can preinstall packages at cluster level or job level before you even start the notebook. I would recommend that before you try anything else. Trust me I've run thousands of libraries including custom built ones and have never needed to-do this.

On to your actual question. Yes but it will throw an exception.

enter image description here

This code will cause you to kill the python process on your DBX workspace

%sh

pid=$(ps aux | grep PythonShell | grep -v grep | awk '{print $2}')
kill -9  $pid

However, it poses a problem for you. If you run this as a bash notebook cell this will not be wrappable in try catch logic which makes automated workflows impossible.

I would have said you can call the shell commands from python but that wouldn't work as the exception would be thrown in that cell. You could perhaps use scala and sys_process library to achieve it but I'm no scala expert sadly.

Hope that helps!

Scott Bell
  • 161
  • 2
  • 11
  • That's very insightful @Scott Bell, thank you for the detailed answer. I'm aware I can manually install packages on the cluster, or use pip install directly in a notebook, but neither of these seem to allow me to programatically install packages by retrieving package names and version numbers from an external file. That's why I had the idea to write a custom function (with Libraries API, POST requests) but then due to the interpreter not being restarted, the newly installed packages were not (always) accessible for the upcoming cells in the notebook – lazarea May 26 '22 at 05:49
  • 1
    So I can help with that. You should look up cluster [init scripts](https://docs.databricks.com/clusters/init-scripts.html) and this answer [covers that](https://stackoverflow.com/questions/62516102/install-python-packages-using-init-scripts-in-a-databricks-cluster). I have achieved what you want before not using the above and instead keeping a file in source control and having my CI/CD process call the databricks API while loading this file. Thus is programmatic & source controlled. – Scott Bell May 26 '22 at 08:15
  • 1
    This is an example for devops https://learn.microsoft.com/en-us/azure/databricks/dev-tools/ci-cd/ci-cd-azure-devops – Scott Bell May 26 '22 at 10:57