I'm trying to do a basic task of Python Dict transformation with Spark. The sample code is as shown below.
from pyspark import SparkContext, SparkConf
import sys
sys.path.insert(0, 'path_to_myModule')
from spark_test import make_generic #the function that transforms the Dict
data = [{'a':1, 'b':2}, {'a':4, 'b':5}]
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
def testing(data):
data['a'] = data['a'] + 1
data['b'] = data['b'] + 2
return data
rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(lambda x: testing(x))
print(rdd2.collect())
rdd3 = rdd1.map(lambda x: make_generic(x)) #does similar task as testing()
print(rdd3.collect())
The path to the module is getting inserted to sys. Yet, I get the below error.
Traceback (most recent call last):
File "/home/roshan/sample_spark.py", line 5, in <module>
from spark_test import make_generic
ImportError: No module named 'spark_test'
Also, the make_generic() function needs a few packages that are installed in a virtualenv.
All in all, I need help in: 1. I need Spark to be able to import the module successfully 2. Be able to use the virtualenv to run the Spark-Submit job.