PySpark: ModuleNotFoundError: No module named 'app'

Question

I am saving a dataframe to a CSV file in PySpark using below statement:

df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')

But i am getting below error

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'

i am using PySpark version 2.3.0

I am getting this error while trying to write to a file.

    import json, jsonschema
    from pyspark.sql import functions
    from pyspark.sql.functions import udf
    from pyspark.sql.types import IntegerType, StringType, FloatType
    from datetime import datetime
    import os

    feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
    apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)

    df_all = feb.union(apr)
    df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])

    create_emi_amount_udf = udf(create_emi_amount, FloatType())
    df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))

    df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')

This is code issue, created by wrong import. You may would like to add your code to the Question. — sumitya, Jul 05 '19 at 16:01
@syadav added the code. I dont have anything with name app in my code — ankit, Jul 09 '19 at 06:17
@ankit Please give me your command to submit spark job. I think something wrong with your command — Duy Nguyen, Jul 09 '19 at 08:25
If you're experiencing this error on Azure Databricks install the dependencies under Libraries under Your_Cluster_Name in Compute. This will resolve the error — The Singularity, Aug 31 '21 at 12:35

score 11 · Accepted Answer · answered Feb 06 '20 at 20:06

The error is very clear, there is not the module 'app'. Your Python code runs on driver, but you udf runs on executor PVM. When you call the udf, spark serializes the create_emi_amount to sent it to the executors.

So, somewhere in your method create_emi_amount you use or import the app module. A solution to your problem is to use the same environment in both driver and executors. In spark-env.sh set the save Python virtualenv in PYSPARK_DRIVER_PYTHON=... and PYSPARK_PYTHON=....

could you please expand yours answer ? I'm newbie with Spark, and I am facing the same problem. — Samir Hinojosa, Mar 05 '22 at 21:58

score 1 · Answer 2 · edited Jul 11 '22 at 11:12

Thanks to ggeop! He helped me out. ggeop has explained the problem. But the solution may not be correct if the 'app' is his own package.

My solution is to add the file in sparkcontext:

sc = SparkContext()
sc.addPyFile("app.zip")

But you have to zip app package first, and you have to make sure the zipped packaged get app directory.
i.e. if your app is at:/home/workplace/app then you have to do the zip under workplace, which will zip all directories under workplace including app.

The other way is to send the file in spark-submit, as below:

--py-files app.zip
--py-files myapp.py

PySpark: ModuleNotFoundError: No module named 'app'

2 Answers2

Linked