Importing Modules on Spark Shared Clusters for Pyspark udf (Databricks Repos)

Question

I am trying to import a module that I have within another repo in databricks, however, spark udf cannot find the module. I can import the module normally and it only fails with the pyspark udf.

I have referenced this stackoverflow post, but the issue is that on our team we work on shared clusters and I do not wish to change the environment. The other method we use is to generate an egg file, however this process is not conducive to quick iteration and testing especially with a shared cluster.

The error:

PythonException: An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'test'. Full traceback below:
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'test'

During handling of the above exception, another exception occurred:

pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'test'

Where I believe the issue originates from (basically during the withColumn with udf):

I can import with no issues

from testimport pyspark_utils

Basically I would like to know if its possible to import custom modules that pyspark can use within databricks repos using the files in repos without the need to build a wheel or egg file or modifying the clusters in ways that may cause conflicts. Thanks for any help or information!

you can package the module and ship it to workers while initializing your spark session -- [how to ship](https://stackoverflow.com/q/24686474/8279585) — samkart, Jun 29 '22 at 06:29

Importing Modules on Spark Shared Clusters for Pyspark udf (Databricks Repos)

0 Answers0