It has been two weeks during which I have been trying to install Spark (pyspark) on my Windows 10 machine, now I realized that I need your help.
When I try to start 'pyspark' in the command prompt, I still receive the following error:
The Problem
'pyspark' is not recognized as an internal or external command, operable program or batch file.
To me this hints at a problem with the path/environmental variables, but I cannot find the root of the problem.
My Actions
I have tried multiple tutorials but the best I found was the one by Michael Galarnyk. I followed his tutorial step by step:
- Installed Java
- Installed Anaconda
Downloaded Spark 2.3.1 (I changed the commands accordingly as Michael's tutorial uses a different version) from the official website. I moved it in line with the tutorial in the cmd prompt:
mv C:\Users\patri\Downloads\spark-2.3.1-bin-hadoop2.7.tgz C:\opt\spark\spark-2.3.1-bin-hadoop2.7.tgz
Then I untarred it:
gzip -d spark-2.3.1-bin-hadoop2.7.tgz
and
tar xvf spark-2.3.1-bin-hadoop2.7.tar
Downloaded Hadoop 2.7.1 from Github:
curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe?raw=true
Set my Environmental Variables accordingly:
setx SPARK_HOME C:\opt\spark\spark-2.3.1-bin-hadoop2.7 setx HADOOP_HOME C:\opt\spark\spark-2.3.1-bin-hadoop2.7 setx PYSPARK_DRIVER_PYTHON jupyter setx PYSPARK_DRIVER_PYTHON_OPTS notebook
Then added C:\opt\spark\spark-2.3.1-bin-hadoop2.7\bin to my path variables. My environmental user variables now look like this: Current Environmental Variables
These actions should have done the trick, but when I run pyspark --master local[2]
, I still get the error from above. Can you help to track down this error using the information from above?
Checks
I ran a couple of checks in the command prompt to verify the following:
- Java is installed
- Anaconda is installed
- pip is installed
- Python is installed