Import Function from a different Python Module into Apache Spark

Question

I'm trying to do a basic task of Python Dict transformation with Spark. The sample code is as shown below.

from pyspark import SparkContext, SparkConf
import sys
sys.path.insert(0, 'path_to_myModule')

from spark_test import make_generic       #the function that transforms the Dict

data = [{'a':1, 'b':2}, {'a':4, 'b':5}]

conf = (SparkConf()
         .setMaster("local")
         .setAppName("My app")
         .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)


def testing(data):
    data['a'] = data['a'] + 1
    data['b'] = data['b'] + 2
    return data

rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(lambda x: testing(x))
print(rdd2.collect())

rdd3 = rdd1.map(lambda x: make_generic(x)) #does similar task as testing()
print(rdd3.collect())

The path to the module is getting inserted to sys. Yet, I get the below error.

Traceback (most recent call last):
  File "/home/roshan/sample_spark.py", line 5, in <module>
    from spark_test import make_generic
ImportError: No module named 'spark_test'

Also, the make_generic() function needs a few packages that are installed in a virtualenv.

All in all, I need help in: 1. I need Spark to be able to import the module successfully 2. Be able to use the virtualenv to run the Spark-Submit job.

AFAIK, the import is handled by python and not spark. Also, where does this `get_input_data` function reside? — samkart, Nov 20 '19 at 09:19
@samkart I'm sorry there was a mistake while copying the error message. I have fixed it now. Coming to the module, the ```make_generic``` function resides in a different directory. Like you said, imports are handled by Python. I have added the path to the modules and ```sys.path``` returns perfectly. Yet, the ImportError — Roshan Zameer, Nov 20 '19 at 10:27
Spark executors do not run the driver script, so the `sys.path.insert()` call never happens there. You should instead modify the executors' `PYTHONPATH` following [this question](https://stackoverflow.com/questions/40808064/pyspark-append-executor-environment-variable). Also, see the answer there for further details. — Hristo Iliev, Nov 20 '19 at 10:51

Import Function from a different Python Module into Apache Spark

0 Answers0