While using Jupyter Notebook with Anaconda, the function called to do this findspark.py does the following:
def find():
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
for path in [
'/usr/local/opt/apache-spark/libexec', # OS X Homebrew
'/usr/lib/spark/' # AWS Amazon EMR
# Any other common places to look?
]:
if os.path.exists(path):
spark_home = path
break
if not spark_home:
raise ValueError("Couldn't find Spark, make sure SPARK_HOME env is set"
" or Spark is in an expected location (e.g. from homebrew installation).")
return spark_home
So we're going to follow the next procedure.
1. Specify SPARK_HOME and JAVA_HOME
As we have seen in the above function, for Windows we need to specifiy the locations. The next function is a slightly modified version from these answer. It is modified because it is also necessary to specify a JAVA_HOME, which is the directory where you have installed it. Also, I have created a spark directory where I moved the dowloaded version of Spark that I'm using, for this procedure you could check out these link.
import os
import sys
def configure_spark(spark_home=None, pyspark_python=None):
spark_home = spark_home or "/path/to/default/spark/home"
os.environ['SPARK_HOME'] = spark_home
os.environ['JAVA_HOME'] = 'C:\Program Files\Java\jre1.8.0_231'
# Add the PySpark directories to the Python path:
sys.path.insert(1, os.path.join(spark_home, 'python'))
sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark'))
sys.path.insert(1, os.path.join(spark_home, 'python', 'build'))
# If PySpark isn't specified, use currently running Python binary:
pyspark_python = pyspark_python or sys.executable
os.environ['PYSPARK_PYTHON'] = pyspark_python
configure_spark('C:\spark\spark-2.4.4-bin-hadoop2.6')
2. Configure SparkContext
When working locally, you should configurate SparkContext in the next way: (these link was useful)
import findspark
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
# Find Spark Locally
location = findspark.find()
findspark.init(location, edit_rc=True)
# Start a SparkContext
configure = SparkConf().set('spark.driver.host','127.0.0.1')
sc = pyspark.SparkContext(master = 'local', appName='desiredName', conf=configure)
This procedure has worked out nice for me, Thanks!.