1

I have following python code:

from service import Api
from pyspark.sql import SparkSession
...
spark = SparkSession.builder.appName("App Name").enableHiveSupport().getOrCreate()
myApi= Api()

df = spark.sql('SELECT * FROM hive_table')

def map_function(row):
        sql = 'SELECT Name FROM sql_table LIMIT 1'
        result = myApi.executeSQL(sql)
        if int(row[4]) > 100:
            return (result[0][0], row[4])
        else:
            return (row[3], row[4])

schema = StructType([StructField('Name', StringType(), True), StructField('Value', IntegerType(), True)])
rdd_data = df.rdd.map(map_function)
df1 = spark.createDataFrame(rdd_data, schema)
df1.show()

I create a Spark DataFrame and use a map function for iteration. In the map function I access a previous defined Api for a SQL Table.

This code runs without errors sucessfully in the console and in a Apache Zeppelin Notebook. But if I execute it in a script the following error occurs:

ImportError: No module named Api

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:470)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:453)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

It occurs while accessing the myApi object in the map function. In the folder of the service module is an __init__.py method, so this can't be the problem.

Does anyone have an idea what the problem could be?

martin32_
  • 21
  • 4

1 Answers1

2

If you are running your jobs through spark-submit you need to provide the python files using the --py-files flag. First, create a .zip file with all the dependencies:

pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .

and finally pass the dependencies using --py-files:

spark-submit --py-files dependencies.zip your_spark_job.py

Finally inside your spark job's script add the following line:

sc.addPyFile("dependencies.zip")

Alternatively, if you are using a Jupyter Notebook, all you have to do is to append the module’s path to PYTHONPATH:

export PYTHONPATH="${PYTHONPATH}:/path/to/your/service.py"
Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156