Shipping and using virtualenv in a pyspark job

Question

PROBLEM: I am attempting to run a spark-submit script from my local machine to a cluster of machines. The work done by the cluster uses numpy. I currently get the following error:

ImportError: 
Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control).  Otherwise reinstall numpy.

Original error was: cannot import name multiarray

DETAIL: In my local environment I have setup a virtualenv that includes numpy as well as a private repo I use in my project and other various libraries. I created a zip file (lib/libs.zip) from the site-packages directory at venv/lib/site-packages where 'venv' is my virtual environment. I ship this zip to the remote nodes. My shell script for performing the spark-submit looks like this:

$SPARK_HOME/bin/spark-submit \
  --deploy-mode cluster \
  --master yarn \
  --conf spark.pyspark.virtualenv.enabled=true  \
  --conf spark.pyspark.virtualenv.type=native \
  --conf spark.pyspark.virtualenv.requirements=${parent}/requirements.txt \
  --conf spark.pyspark.virtualenv.bin.path=${parent}/venv \
  --py-files "${parent}/lib/libs.zip" \
  --num-executors 1 \
  --executor-cores 2 \
  --executor-memory 2G \
  --driver-memory 2G \
  $parent/src/features/pi.py

I also know that on the remote nodes there is a /usr/local/bin/python2.7 folder that includes a python 2.7 install.

so in my conf/spark-env.sh I have set the following:

export PYSPARK_PYTHON=/usr/local/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7

When I run the script I get the error above. If I screen print the installed_distributions I get a zero length list []. Also my private library imports correctly (which says to me it is actually accessing my libs.zip site-packages.). My pi.py file looks something like this:

from myprivatelibrary.bigData.spark import spark_context
spark = spark_context()
import numpy as np
spark.parallelize(range(1, 10)).map(lambda x: np.__version__).collect()

EXPECTATION/MY THOUGHTS: I expect this to import numpy correctly especially since I know numpy works correctly in my local virtualenv. I suspect this is because I'm not actually using the version of python that is installed in my virtualenv on the remote node. My question is first, how do I fix this and second how do I use my virtualenv installed python on the remote nodes instead of the python that is just manually installed and currently sitting on those machines? I've seen some write-ups on this but frankly they are not well written.

akoeltringer · Accepted Answer · 2017-09-08T10:41:51.107

With --conf spark.pyspark.{} and export PYSPARK_PYTHON=/usr/local/bin/python2.7 you set options for your local environment / your driver. To set options for the cluster (executors) use the following syntax:

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON

Furthermore, I guess you should make your virtualenv relocatable (this is experimental, however). <edit 20170908> This means that the virtualenv uses relative instead of absolute links. </edit>

What we did in such cases: we shipped an entire anaconda distribution over hdfs.

<edit 20170908>

If we are talking about different environments (MacOs vs. Linux, as mentioned in the comment below), you cannot just submit a virtualenv, at least not if your virtualenv contains packages with binaries (as is the case with numpy). In that case I suggest you create yourself a 'portable' anaconda, i.e. install Anaconda in a Linux VM and zip it.

Regarding --archives vs. --py-files:

--py-files adds python files/packages to the python path. From the spark-submit documentation:

For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
--archives means these are extracted into the working directory of each executor (only yarn clusters).

However, a crystal-clear distinction is lacking, in my opinion - see for example this SO post.

In the given case, add the anaconda.zip via --archives, and your 'other python files' via --py-files.

</edit>

See also: Running Pyspark with Virtualenv, a blog post by Henning Kropp.

To set the `PYSPARK_PYTHON` for executors, you should probably use `spark.executorEnv.PYSPARK_PYTHON` rather than `spark.yarn.appMasterEnv.PYSPARK_PYTHON`. According to spark documentation `spark.yarn.appMasterEnv` actually manages environment of the driver when running in `yarn cluster` mode and only the environment of executor LAUNCHERS when running in `yarn client` mode. — Jakub Kukul, Nov 29 '18 at 22:26
@jkukul In a note in the Spark Docs it states that ``When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property [...]``. The OP asked for this (``yarn`` in ``cluster`` mode). This certainly also works for the executors - we (sadly) only have Python2.7 natively on the cluster but distribute Python3.6 (Anaconda) that way - Spark wouldn't work with different Python versions on Driver and Executors. — akoeltringer, Dec 17 '18 at 09:23

Shipping and using virtualenv in a pyspark job

1 Answers1

Linked