2

I have a question regarding the correct way to install new packages on Spark worker nodes, using Databricks and Mlflow.

What I currently have are the following:

  1. a training script (using cv2, i.e. opencv2-python library), which logs the tuned ML model, together with dependencies, on mlflow Model Registry
  2. an inference script which reads the logged ML model together with the saved conda environment, as a spark_udf
  3. an installation step which reads the conda environment and installs all packages to the required version, via pip install (wrapped in a subprocess.call)
  4. and then the actual prediction where the spark_udf is called on new data.

The last step is what fails with ModuleNotFoundError.

SparkException: Job aborted due to stage failure: Task 8 in stage 95.0 
failed 4 times, most recent failure: Lost task 8.3 in stage 95.0 (TID 577 
(xxx.xxx.xxx.xxx executor 0): org.apache.spark.api.python.PythonException: 
'ModuleNotFoundError: No module named 'cv2''

I have been closely following the content of this article which seems to cover this same problem:

  • "One major reason for such issues is using udfs. And sometimes the udfs don’t get distributed to the cluster worker nodes."
  • "The respective dependency modules used in the udfs or in main spark program might be missing or inaccessible from\in the cluster worker nodes."

So it seems that at the moment, despite using spark_udf and conda environment logged to mlflow, installation of the cv2 module only happened on my driver node, but not on the worker nodes. If this is true, I now need to programatically specify these extra dependencies (namely, the cv2 Python module), to the executors/worker nodes.

So what I did was, importing the cv2 in my inference script, and retrieving the path of the cv2's init file, and adding it to spark context, similarly to how it is done for the arbitrary "A.py" file in the blog post above.

import cv2
spark.sparkContext.addFile(os.path.abspath(cv2.__file__))

This doesn't seem to do any change though. I assume the reason is, partly, that I want to add not just a single __init__.py file, but make the entire cv2 library accessible to the worker nodes; however, the above solution seems to only do it for the __init__.py. I'm certain that adding all files in all submodules of cv2 is also not the way to go, but I haven't been able to figure out how I could achieve this easily, with a similar command as the addFile() above.

Similarly, I also tried the other option, addPyFile(), by pointing it to the cv2 module's root (parent of __init__):

import cv2
spark.sparkContext.addPyFile(os.path.dirname(cv2.__file__))

but this also didn't help, and I still got stuck with the same error. Furthermore, I would like this process to be automatic, i.e. not having to manually set module paths in the inference code.

Similar posts I came across to:

lazarea
  • 1,129
  • 14
  • 43
  • does inference happens on the "normal" cluster, or model serving cluster? – Alex Ott May 30 '22 at 18:06
  • There are two clusters involved, difference is just one package. Training uses a cluster which has cv2 installed, while inference uses a cluster without the cv2 because I want to retrieve that library directly from the logs. – lazarea May 31 '22 at 07:05
  • but why simply not install the `cv2` on the inference cluster? – Alex Ott May 31 '22 at 07:10
  • Note that this is just a mock example. In reality, there are way more dependencies (that are not included in the Databricks ML Runtime), and inference cluster should not have direct access to what packages the original model is using. Mlflow provides this isolation and allows the logging of dependencies - so it seems logical to rely on this isolation capability and not individually reinstall every dependency. – lazarea May 31 '22 at 07:42

0 Answers0