Importing user-defined module fails in PySpark

Question

I have following python code:

from service import Api
from pyspark.sql import SparkSession
...
spark = SparkSession.builder.appName("App Name").enableHiveSupport().getOrCreate()
myApi= Api()

df = spark.sql('SELECT * FROM hive_table')

def map_function(row):
        sql = 'SELECT Name FROM sql_table LIMIT 1'
        result = myApi.executeSQL(sql)
        if int(row[4]) > 100:
            return (result[0][0], row[4])
        else:
            return (row[3], row[4])

schema = StructType([StructField('Name', StringType(), True), StructField('Value', IntegerType(), True)])
rdd_data = df.rdd.map(map_function)
df1 = spark.createDataFrame(rdd_data, schema)
df1.show()

I create a Spark DataFrame and use a map function for iteration. In the map function I access a previous defined Api for a SQL Table.

This code runs without errors sucessfully in the console and in a Apache Zeppelin Notebook. But if I execute it in a script the following error occurs:

ImportError: No module named Api

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

It occurs while accessing the myApi object in the map function. In the folder of the service module is an __init__.py method, so this can't be the problem.

Does anyone have an idea what the problem could be?

Does this answer your question? [Shipping Python modules in pyspark to other nodes](https://stackoverflow.com/questions/24686474/shipping-python-modules-in-pyspark-to-other-nodes) — user10938362, Mar 26 '20 at 16:31
Sorry, I have to take back the comment before. The script is not in the same folder. It is installed with pip. It is stored in the ```/usr/bin/anaconda/lib/python2.7/site-packages/``` folder. — martin32_, Mar 26 '20 at 17:21

Giorgos Myrianthous · Accepted Answer · 2020-03-26T17:04:15.113

2

If you are running your jobs through spark-submit you need to provide the python files using the --py-files flag. First, create a .zip file with all the dependencies:

pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .

and finally pass the dependencies using --py-files:

spark-submit --py-files dependencies.zip your_spark_job.py

Finally inside your spark job's script add the following line:

sc.addPyFile("dependencies.zip")

Alternatively, if you are using a Jupyter Notebook, all you have to do is to append the module’s path to PYTHONPATH:

export PYTHONPATH="${PYTHONPATH}:/path/to/your/service.py"

edited Mar 26 '20 at 17:04

answered Mar 26 '20 at 15:40

Giorgos Myrianthous

36,235
20
134
156

Thanks for the answer. But it still doesn't work. I updated the error in the text, maybe the stacktrace is helpful – martin32_ Mar 26 '20 at 16:03
@martin32_ I've updated my answer. Hope it helps now. – Giorgos Myrianthous Mar 26 '20 at 17:04
1

It helped me a lot! I executed my script with python. The solution was to execute it with spark-submit – martin32_ Mar 27 '20 at 18:31

Importing user-defined module fails in PySpark

1 Answers1