2

I've pip installed boto3 on my local machine, and then I ran spark-submit in local mode while passing the path to the directory boto3 is installed in. Leaving me with the following command:

spark-submit --conf spark.driver.extraClassPath=/Library/Python/2.7/site-packages app.py

And then when import boto3 in my app.py, it throws the dreaded module not found error.

Is this the correct way to add a pip-installed python dependency to a spark-submit job?

Kristian
  • 21,204
  • 19
  • 101
  • 176

1 Answers1

0

The Python used by Spark versus the one that contains the python dependencies from pip_install is different. In .bash_profile, set your PYSPARK_PYTHON path and PYSPARK_DRIVER_PYTHON to the correct Python path that contains the pip_installs.

    export SPARK_HOME=/usr/local/Cellar/apache-spark/2.1.0/libexec
    export PYTHONPATH=/usr/local/opt/python/bin/python2.7/:$PYTHONPATH$
    export PYSPARK_PYTHON=/usr/local/opt/python/bin/python2.7
    export PYSPARK_DRIVER_PYTHON=/usr/local/opt/python/bin/python2.7

If the Python on your shell contains all the pip_install programs, you can set PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to that path. Find it here.

Derek
  • 1
  • 2