0

Usually when running such spark job:

spark-submit --master yarn --deploy-mode cluster \
             --driver-memory <xxG> \
             --conf <all relevant configs> hdfs:///test-numpy.py

We need to have the numpy pip package installed on each node of the hadoop cluster.

Since there are lots of pip packages used by our code, we prefer to have a zip file to include all of our dependencies and load it to the hdfs and use it from there, without the need to change something in the python code.

How can it be done? We tried to create a .zip file with virtualenv and to run it as follow:

spark-submit --master yarn --deploy-mode cluster \
             --driver-memory <xxG> \
             --conf <all relevant configs> \
             --py-files hdfs:///dependencies.zip hdfs:///test-numpy.py

BUT the job failed with this import error:

ImportError: No module named numpy

Is it the right way to use pip packages?

RaN
  • 11
  • 3
  • Does this answer your question? [I can't seem to get --py-files on Spark to work](https://stackoverflow.com/questions/36461054/i-cant-seem-to-get-py-files-on-spark-to-work) – blackbishop Dec 31 '19 at 15:00
  • Thanks for the link, it seems that it could work, but it requires a python code change, which I prefer to avoid. – RaN Jan 01 '20 at 09:31

0 Answers0