1

I am trying to import a module that I have within another repo in databricks, however, spark udf cannot find the module. I can import the module normally and it only fails with the pyspark udf.

I have referenced this stackoverflow post, but the issue is that on our team we work on shared clusters and I do not wish to change the environment. The other method we use is to generate an egg file, however this process is not conducive to quick iteration and testing especially with a shared cluster.

The error:

PythonException: An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'test'. Full traceback below:
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'test'

During handling of the above exception, another exception occurred:

pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'test'

Where I believe the issue originates from (basically during the withColumn with udf):

I can import with no issues

from testimport pyspark_utils

Basically I would like to know if its possible to import custom modules that pyspark can use within databricks repos using the files in repos without the need to build a wheel or egg file or modifying the clusters in ways that may cause conflicts. Thanks for any help or information!

bbl007
  • 11
  • 2
  • 2
    you can package the module and ship it to workers while initializing your spark session -- [how to ship](https://stackoverflow.com/q/24686474/8279585) – samkart Jun 29 '22 at 06:29

0 Answers0