1

I am new to SPARK and trying to use it in windows. I was able to successfully download and install Spark 1.4.1 using pre-build version with hadoop. In the following directory:

/my/spark/directory/bin

I can run the spark-shell and pyspark.cmd and everything works fine. The only problem I am dealing with is that I want to import pyspark while I am coding in Pycharm. Right now I am using the following code to make things work:

import sys
import os
from operator import add

os.environ['SPARK_HOME'] = "C:\spark-1.4.1-bin-hadoop2.6"
sys.path.append("C:\spark-1.4.1-bin-hadoop2.6/python")
sys.path.append("C:\spark-1.4.1-bin-hadoop2.6/python/build")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf

except ImportError as e:
    print ("Error importing Spark Modules", e)
    sys.exit(1)

I am wondering if there is an easier way for doing this. I am using Windows 8 - Python 3.4 and Spark 1.4.1

ahajib
  • 12,838
  • 29
  • 79
  • 120

1 Answers1

1

That's about the easiest way I've found. I typically use a function like the following to make things a bit less repetitive.

def configure_spark(spark_home=None, pyspark_python=None):
    spark_home = spark_home or "/path/to/default/spark/home"
    os.environ['SPARK_HOME'] = spark_home

    # Add the PySpark directories to the Python path:
    sys.path.insert(1, os.path.join(spark_home, 'python'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'build'))

    # If PySpark isn't specified, use currently running Python binary:
    pyspark_python = pyspark_python or sys.executable
    os.environ['PYSPARK_PYTHON'] = pyspark_python

Then, you can call the function before importing pyspark:

configure_spark('/path/to/spark/home')
from pyspark import SparkContext
Community
  • 1
  • 1
santon
  • 4,395
  • 1
  • 24
  • 43