PROBLEM: I am attempting to run a spark-submit script from my local machine to a cluster of machines. The work done by the cluster uses numpy. I currently get the following error:
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
Original error was: cannot import name multiarray
DETAIL: In my local environment I have setup a virtualenv that includes numpy as well as a private repo I use in my project and other various libraries. I created a zip file (lib/libs.zip) from the site-packages directory at venv/lib/site-packages where 'venv' is my virtual environment. I ship this zip to the remote nodes. My shell script for performing the spark-submit looks like this:
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.requirements=${parent}/requirements.txt \
--conf spark.pyspark.virtualenv.bin.path=${parent}/venv \
--py-files "${parent}/lib/libs.zip" \
--num-executors 1 \
--executor-cores 2 \
--executor-memory 2G \
--driver-memory 2G \
$parent/src/features/pi.py
I also know that on the remote nodes there is a /usr/local/bin/python2.7 folder that includes a python 2.7 install.
so in my conf/spark-env.sh I have set the following:
export PYSPARK_PYTHON=/usr/local/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7
When I run the script I get the error above. If I screen print the installed_distributions I get a zero length list []. Also my private library imports correctly (which says to me it is actually accessing my libs.zip site-packages.). My pi.py file looks something like this:
from myprivatelibrary.bigData.spark import spark_context
spark = spark_context()
import numpy as np
spark.parallelize(range(1, 10)).map(lambda x: np.__version__).collect()
EXPECTATION/MY THOUGHTS: I expect this to import numpy correctly especially since I know numpy works correctly in my local virtualenv. I suspect this is because I'm not actually using the version of python that is installed in my virtualenv on the remote node. My question is first, how do I fix this and second how do I use my virtualenv installed python on the remote nodes instead of the python that is just manually installed and currently sitting on those machines? I've seen some write-ups on this but frankly they are not well written.