-1

Normally we do a spark-submit with the zip file spark-submit --name App_Name --master yarn --deploy-mode cluster --archives /<path>/myzip.zip#pyzip /<path>/Processfile.py and access them in the py files using from dir1.dir2.dir3.module_name import module_name and the module import works fine.

When I try to do the same in pyspark shell, it gives me a module not found error. pyspark --py-files /<path>/myzip.zip#pyzip

How can the modules be accessed in the spark shell.

DataWrangler
  • 1,804
  • 17
  • 32

2 Answers2

2

You can use the spark context available in the pyspark shell under 'spark' Spark session variable as follows

spark.sparkContext.addPyFile('Path to your file')

As per spark-docs .py or .zip dependency with python code is supported in this.

 |  addPyFile(self, path)
 |      Add a .py or .zip dependency for all tasks to be executed on this
 |      SparkContext in the future.  The C{path} passed can be either a local
 |      file, a file in HDFS (or other Hadoop-supported filesystems), or an
 |      HTTP, HTTPS or FTP URI.
 |
 |      .. note:: A path can be added only once. Subsequent additions of the same path are ignored.

Below is the successful import and function call after using zip

>>> sc.addPyFile('D:\pyspark_test.zip')
>>> import test
>>> test
<module 'test' from 'C:\\Users\\AppData\\Local\\Temp\\spark-f4559ba6-0661-4cea-a841-55d7550d809d\\userFiles-062f5965-e5df-4d26-b2cd-daf7613df56a\\pyspark_test.zip\\test.py'>
>>> test.print_data()
hello
>>>

Make sure you have the zip file structure as follows. While creating zip select all the induvidual files in the module and then create a zip instead of selecting the module folder and then creating the zip file

└───pyspark_test
        test.py
        _init_.py
  • I did try this https://stackoverflow.com/a/39779271/6304472 but still I was getting the module not found, can you please add more info on how to access the modules. Some working code I can test ? – DataWrangler Jan 02 '20 at 14:03
  • `/home/user/package/local/lib/python3.6/dist-packages/common/process/filename.py` Under the common folder I have below files, similarly the sub directories process has them along with the filename.py file `__init__.py __pycache__` Even after following the process, still the issue persists `>>> sc.addPyFile('//myzip.zip')` `>>> import filename` `Traceback (most recent call last):` ` File "", line 1, in ` `ModuleNotFoundError: No module named 'filename'` – DataWrangler Jan 03 '20 at 06:32
  • `from common.process.filename import filename` tried this too but still no luck – DataWrangler Jan 03 '20 at 06:36
  • select the files filename.py and _init_.py and create a zip out of it myzip.zip. – kirtan_shah Jan 03 '20 at 06:38
  • Please paste complete error after ModuleNotFoundError: – kirtan_shah Jan 03 '20 at 06:38
  • its the complete error message that I got in the Pyspark shell – DataWrangler Jan 03 '20 at 06:39
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/205277/discussion-between-kirtan-shah-and-joby). – kirtan_shah Jan 03 '20 at 06:41
0

Was able to finally import the modules in the Pyspark shell, the ZIP that I am passing has all the dependency modules installed into a virtual environment in Python and made as a ZIP.

So in such cases going virtual and then starting the Pyspark shell did the trick.

source bin/activate
pyspark --archives <path>/filename.zip

This didn't require me to add the pyfiles to the sparkContext too.

DataWrangler
  • 1,804
  • 17
  • 32