Azure databricks PySpark custom UDF ModuleNotFoundError: No module named

Question

I was checking this SO but none of the solutions helped PySpark custom UDF ModuleNotFoundError: No module named

I have the current repo on azure databricks:

|-run_pipeline.py
|-__init__.py
|-data_science
|--__init.py__
|--text_cleaning
|---text_cleaning.py
|---__init.py__

On the run_pipeline notebook I have this

from data_science.text_cleaning import text_cleaning
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
spark = SparkSession.builder.master(
    "local[*]").appName('workflow').getOrCreate()

df = text_cleaning.basic_clean(spark_df)

On the text_cleaning.py I have a function called basic_clean that will run something like this:

 def basic_clean(df):
    print('Removing links')
    udf_remove_links = udf(_remove_links, StringType())
    df = df.withColumn("cleaned_message", udf_remove_links("cleaned_message"))
    return df

When I do df.show() on the run_pipeline notebook, I get this error message:

Exception has occurred: PythonException       (note: full exception trace is shown but execution is paused at: <module>)
An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science''. Full traceback below:
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science'

Shouldnt the imports work? Why is this an issue?

score 0 · Answer 1 · answered Feb 02 '23 at 09:10

I've been facing the same issue running pyspark tests with UDFs in Azure Devops. I've noticed that this happens when running from the pool with vmImage:ubuntu-latest. When I use a custom container build from the following Dockerfile, the tests run fine:

FROM python:3.8.3-slim-buster AS py3
FROM openjdk:8-slim-buster

ENV PYSPARK_VER=3.3.0
ENV DELTASPARK_VER=2.1.0

COPY --from=py3 / /

WORKDIR /setup

COPY requirements.txt .

RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt && \
  rm requirements.txt

WORKDIR /code

requirements.txt contains pyspark==3.3.0 and delta-spark==2.1.0.

This led me to conclude that it's due to how spark runs in the default ubuntu VM which runs python 3.10.6 and java 11 (at the time of posting this). I've tried setting env variables such as PYSPARK_PYTHON to enforce pyspark to use the same python binary on which the to-be-tested package is installed but to no avail.

Maybe you can use this information to find a way to use the default agent pool's ubuntu vm to get it to work, otherwise I recommend just using a pre-configured container like I did.

ShaikMaheer · Answer 2 · 2022-12-05T06:04:25.420

It seems data-science module is missing on cluster. Kindly consider installing it on cluster. Please check below link about installing libraries to cluster. https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries

You can consider executing pip list command to see libraries installed on cluster.

You can consider running pip install data_science command also directly in notebook cell.

Azure databricks PySpark custom UDF ModuleNotFoundError: No module named

2 Answers2