3

I know this question has been posted before, but I tried implementing the solutions, but none worked for me. I installed Spark for Jupyter Notebook using this tutorial:

https://medium.com/@GalarnykMichael/install-spark-on-mac-pyspark-
453f395f240b#.be80dcqat

Installed Latest Version of Apache Spark on the MAC

When I try to run the following code in Jupyter

wordcounts = sc.textFile('words.txt')

I get the following error:

name 'sc' is not defined

When I try adding the Code:

from pyspark import SparkContext, SparkConf
sc =SparkContext()

getting the following error:

An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.util.StringUtils
at
org.apache.hadoop.security.SecurityUtil.
getAuthenticationMethod(SecurityUtil.java:611)

Added the path in bash:

export SPARK_PATH=~/spark-2.2.1-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

#For python 3, You have to add the line below or you will get an error
# export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'

Please help me resolve this.

zero323
  • 322,348
  • 103
  • 959
  • 935
Susha Suresh
  • 65
  • 1
  • 1
  • 8
  • How do you set spark_path? could you provide the full url? – george Jan 18 '18 at 22:19
  • I added the following to the bash export SPARK_PATH=~/spark-2.2.1-bin-hadoop2.7 export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" #For python 3, You have to add the line below or you will get an error # export PYSPARK_PYTHON=python3 alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]' – Susha Suresh Jan 18 '18 at 22:24
  • which version of java do you have? Can you type: java -version ,and share the result? – george Jan 18 '18 at 22:42
  • `PYSPARK_DRIVER_PYTHON="jupyter"` is a really crappy solution, and it should be avoided: https://stackoverflow.com/questions/47824131/configuring-spark-to-work-with-jupyter-notebook-and-anaconda/47870277#47870277 – desertnaut Jan 18 '18 at 23:15
  • Sorry for the late reply. Version is 9 – Susha Suresh Jan 19 '18 at 02:31

1 Answers1

0

These steps solve my problem [ pyspark with jupyter notebook local setup for window os ]

Error for me in jupyter notebook

enter image description here


  1. download and install java8 : https://www.oracle.com/java/technologies/downloads/#java8-windows

  2. download spark-3.2.1-bin-hadoop2.7 : https://spark.apache.org/downloads.html

  • Unpack the .tgz file using 7zip or other tool
  • put it like C:\spark-3.2.1-bin-hadoop2.7

Note : we will use this path for environment variables

  1. download winutils.exe : https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin
  • put it into C:\Hadoop\bin location
  1. download and install python on window : https://www.python.org/downloads/

  2. Add environment variables:

In the Settings window, under Related Settings, click Advanced system settings. On the Advanced tab, click Environment Variables. Click New to create a new environment variable. Click Edit to modify an existing environment variable.

5.1. User variable:

  • JAVA_HOME : C:\Program Files\Java\jdk-1.8
  • PATH : %JAVA_HOME%\bin
  • HADOOP_HOME : C:\Hadoop
  • PYSPARK_DRIVER_PYTHON : jupyter
  • PYSPARK_DRIVER_PYTHON_OPTS : notebook
  • PYSPARK_PYTHON : xxxxx\AppData\Local\Programs\Python\Python39\Scripts
  • SPARK_HOME : C:\spark-3.2.1-bin-hadoop2.7
  • SPARK_LOCAL_IP : localhost

5.2. system variables:

  • C:\Program Files\Java\jdk-20\bin
  • C:\spark-3.2.1-bin-hadoop2.7\bin
  • C:\Hadoop\bin

Testing:

  1. open cmd and run >> java -version

    C:\Users\xxxxxxx>java -version

it should return like

java version "1.8.0_371"
Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)
  1. In cmd run C:\Users\xxxxxxx>pyspark

this command redirect you at http://localhost:8890/tree

  1. create new notebook and write below code and run

    import findspark findspark.init()

    import pyspark # only run after findspark.init() from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate()

    df = spark.sql('''select 'spark' as India ''') df.show()

After following above steps

enter image description here

------------- Note ------------------

if all set and you still showing error "Using Spark's default log4j ERROR SparkContext ....... " then try below steps:

  1. try to close cmd window and reopen it and try to execute >>pyspark command in cmd again

  2. try to restart your system and reopen cmd and try to re run >>pyspark command again

  3. check java version, some time java latest version raise error with pyspark try with jdk-1.8 and spark-3.2.1-bin-hadoop2.7, seems like jdk-1.8 and spark-3.2.1-bin-hadoop2.7 is working for me.

Vijay
  • 141
  • 7