7

It has been two weeks during which I have been trying to install Spark (pyspark) on my Windows 10 machine, now I realized that I need your help.

When I try to start 'pyspark' in the command prompt, I still receive the following error:

The Problem

'pyspark' is not recognized as an internal or external command, operable program or batch file.

To me this hints at a problem with the path/environmental variables, but I cannot find the root of the problem.

My Actions

I have tried multiple tutorials but the best I found was the one by Michael Galarnyk. I followed his tutorial step by step:

  • Installed Java
  • Installed Anaconda
  • Downloaded Spark 2.3.1 (I changed the commands accordingly as Michael's tutorial uses a different version) from the official website. I moved it in line with the tutorial in the cmd prompt:

    mv C:\Users\patri\Downloads\spark-2.3.1-bin-hadoop2.7.tgz C:\opt\spark\spark-2.3.1-bin-hadoop2.7.tgz
    

    Then I untarred it:

    gzip -d spark-2.3.1-bin-hadoop2.7.tgz
    

    and

    tar xvf spark-2.3.1-bin-hadoop2.7.tar
    
  • Downloaded Hadoop 2.7.1 from Github:

    curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe?raw=true
    
  • Set my Environmental Variables accordingly:

    setx SPARK_HOME C:\opt\spark\spark-2.3.1-bin-hadoop2.7
    setx HADOOP_HOME C:\opt\spark\spark-2.3.1-bin-hadoop2.7
    setx PYSPARK_DRIVER_PYTHON jupyter
    setx PYSPARK_DRIVER_PYTHON_OPTS notebook
    

    Then added C:\opt\spark\spark-2.3.1-bin-hadoop2.7\bin to my path variables. My environmental user variables now look like this: Current Environmental Variables

These actions should have done the trick, but when I run pyspark --master local[2], I still get the error from above. Can you help to track down this error using the information from above?

Checks

I ran a couple of checks in the command prompt to verify the following:

  • Java is installed
  • Anaconda is installed
  • pip is installed
  • Python is installed
Patrick Glettig
  • 541
  • 1
  • 6
  • 12

2 Answers2

4

I resolved this issue by setting the variables as "system variables" rather than "user variables". Note

  1. In my case setting variables from command line resulted in "user variables" so I had to use the Advanced settings GUI to enter values as "system variables"
  2. You may want to rule out any installation issue, in which case try to cd into C:\opt\spark\spark-2.3.1-bin-hadoop2.7\bin and run pyspark master local[2] (make sure winutils.exe is there); if that does not work then you have other issues than just env variables
Zoe
  • 27,060
  • 21
  • 118
  • 148
mchl_k
  • 314
  • 1
  • 10
  • Thanks a lot for your response! Indeed changing from user to system variables changed it, however now I get the following error: "The system cannot find the path specified". Even though when I manually check with "where python/conda/java etc" It tells me they are there. I googled this error, stackoverflow suggests it's from not [downloading the whole distribution](https://stackoverflow.com/questions/46849585/pyspark-the-system-cannot-find-the-path-specified?noredirect=1&lq=1) but I did so following the tutorial. Any ideas on how to fix this? – Patrick Glettig Jan 20 '19 at 13:00
  • Can you try findspark as described here: https://github.com/minrk/findspark? In the meantime can you accept this answer as it addresses the exact issue from the question? – mchl_k Feb 13 '19 at 09:13
  • Thanks for the package, this seams very promising. When I try to run it as suggested in the `ReadMe`, `findspark.init()` an out of range error is returned: **IndexError: list index out of range**. `findspark.find()` works so I tried `findspark.init(findspark.find())` but no luck. – Patrick Glettig Feb 14 '19 at 13:35
2

Follow the given steps explained in my blog will resolve your problem-

How to Setup PySpark on Windows https://beasparky.blogspot.com/2020/05/how-to-setup-pyspark-in-windows.html

To set up the environment paths for Spark.

Go to "Advanced System Settings" and set below paths
JAVA_HOME="C:\Program Files\Java\jdk1.8.0_181"
HADOOP_HOME="C:\spark-2.4.0-bin-hadoop2.7"
SPARK_HOME="C:\spark-2.4.0-bin-hadoop2.7"
Also, add their bin path into the PATH system variable
  • Thank you for contributing. Could you summarize the most important steps out of this blog post? Links tend to change or disappear in the future... – Michael Heil May 04 '20 at 06:48
  • This is good insight. Would you have the time to explain how many of these libraries interact? For example, if the SPARK_HOME is embedded in Anaconda ( C:\Users\mylogin\Anaconda3\lib\site-packages\pyspark ) but also in a spark directory, would there be a different path structure (as above, your spark is in its own folder). – jamiel22 Mar 24 '23 at 15:24