0

I'm trying to configure apache-spark on MacOS. All the online guides ask to either download the spark tar and set up some env variables or to use brew install apache-spark and then setup some env variables.

Now I installed apache-spark using brew install apache-spark. I run pyspark in terminal and I am getting a python prompt which suggests that the installation was successful.

Now when I try to do import pyspark into my python file, I'm facing error saying ImportError: No module named pyspark

The strangest thing I'm not able to understand is how is it able to start an REPL of pyspark and not able to import the module into python code.

I also tried doing pip install pyspark but it does not recognize the module either.

In addition to installing apache-spark with homebrew, I've set up following env variables.

if which java > /dev/null; then export JAVA_HOME=$(/usr/libexec/java_home); fi

if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/2.1.0/libexec/"
  export PYSPARK_SUBMIT_ARGS="--master local[2]"
fi

Please suggest what exactly is missing on my setup to run pyspark code on my local machine.

Keyur Golani
  • 573
  • 8
  • 26

2 Answers2

5

pyspark module is not include in your python

Try this instead

import os
import sys

os.environ['SPARK_HOME'] = "/usr/local/Cellar/apache-spark/2.1.0/libexec/"

sys.path.append("/usr/local/Cellar/apache-spark/2.1.0/libexec/python")
sys.path.append("/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/py4j-0.10.4-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf

except ImportError as e:
    print ("error importing spark modules", e)
    sys.exit(1)

sc = SparkContext('local[*]','PySpark')

if you don't want that, include them into your system PATH. And don't forget to include the python path.

export SPARK_HOME=/usr/local/Cellar/apache-spark/2.1.0/libexec/
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/python:$PATH
ling7334
  • 444
  • 3
  • 13
  • This approach did work. However, I am looking for a more permanent solution where I don't have to write the three lines everytime. I also tried configuring the same three things in OS environment. But with that approach, it is still not able to detect the pyspark module. – Keyur Golani Jan 23 '17 at 18:14
1

sorry I dont use MAC , but there is another way in linux beside above answer:

sudo ln -s $SPARK_HOME/python/pyspark /usr/local/lib/python2.7/site-packages

Python will read module from /path/to/your/python/site-packages at last

Zhang Tong
  • 4,569
  • 3
  • 19
  • 38
  • This kind of worked but then it doesn't find py4j protocol. Gives following error. ```No module named py4j.protocol``` However, it did resolve the error of not finding ```pyspark``` module. – Keyur Golani Jan 24 '17 at 05:43
  • 1
    @KeyurGolani just pip install py4j – Zhang Tong Jan 24 '17 at 07:08
  • or find it in $SPARK_HOME/python/lib/py4j-0.10.4-src.zip and unzip it to site-packages – Zhang Tong Jan 24 '17 at 07:16
  • 1
    Ok. ```pip install py4j``` was giving installation error. However, I used ```easy_install py4j``` from http://stackoverflow.com/a/33463702/3078330 and it worked fine. Just had to set ```export SPARK_HOME=/usr/local/Cellar/apache-spark/2.1.0/libexec/ export PATH=$PATH:/usr/local/Cellar/apache-spark/2.1.0/libexec/python export PATH=$PATH:/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/py4j-0.10.4-src.zip``` in addition. Thanks a lot @zhangtong – Keyur Golani Jan 24 '17 at 18:20