How to run py job spark-submit in yarn cluster mode including pip packages (e.g numpy) on hdfs

Question

Usually when running such spark job:

spark-submit --master yarn --deploy-mode cluster \
             --driver-memory <xxG> \
             --conf <all relevant configs> hdfs:///test-numpy.py

We need to have the numpy pip package installed on each node of the hadoop cluster.

Since there are lots of pip packages used by our code, we prefer to have a zip file to include all of our dependencies and load it to the hdfs and use it from there, without the need to change something in the python code.

How can it be done? We tried to create a .zip file with virtualenv and to run it as follow:

spark-submit --master yarn --deploy-mode cluster \
             --driver-memory <xxG> \
             --conf <all relevant configs> \
             --py-files hdfs:///dependencies.zip hdfs:///test-numpy.py

BUT the job failed with this import error:

ImportError: No module named numpy

Is it the right way to use pip packages?

Does this answer your question? [I can't seem to get --py-files on Spark to work](https://stackoverflow.com/questions/36461054/i-cant-seem-to-get-py-files-on-spark-to-work) — blackbishop, Dec 31 '19 at 15:00
Thanks for the link, it seems that it could work, but it requires a python code change, which I prefer to avoid. — RaN, Jan 01 '20 at 09:31

How to run py job spark-submit in yarn cluster mode including pip packages (e.g numpy) on hdfs

0 Answers0