Usually when running such spark job:
spark-submit --master yarn --deploy-mode cluster \
--driver-memory <xxG> \
--conf <all relevant configs> hdfs:///test-numpy.py
We need to have the numpy pip package installed on each node of the hadoop cluster.
Since there are lots of pip packages used by our code, we prefer to have a zip file to include all of our dependencies and load it to the hdfs and use it from there, without the need to change something in the python code.
How can it be done?
We tried to create a .zip
file with virtualenv and to run it as follow:
spark-submit --master yarn --deploy-mode cluster \
--driver-memory <xxG> \
--conf <all relevant configs> \
--py-files hdfs:///dependencies.zip hdfs:///test-numpy.py
BUT the job failed with this import error:
ImportError: No module named numpy
Is it the right way to use pip packages?