How to troubleshoot 'pyspark' is not recognized... error on Windows?

Question

It has been two weeks during which I have been trying to install Spark (pyspark) on my Windows 10 machine, now I realized that I need your help.

When I try to start 'pyspark' in the command prompt, I still receive the following error:

The Problem

'pyspark' is not recognized as an internal or external command, operable program or batch file.

To me this hints at a problem with the path/environmental variables, but I cannot find the root of the problem.

My Actions

I have tried multiple tutorials but the best I found was the one by Michael Galarnyk. I followed his tutorial step by step:

Installed Java
Installed Anaconda
Downloaded Spark 2.3.1 (I changed the commands accordingly as Michael's tutorial uses a different version) from the official website. I moved it in line with the tutorial in the cmd prompt:
```
mv C:\Users\patri\Downloads\spark-2.3.1-bin-hadoop2.7.tgz C:\opt\spark\spark-2.3.1-bin-hadoop2.7.tgz
```
Then I untarred it:
```
gzip -d spark-2.3.1-bin-hadoop2.7.tgz
```
and
```
tar xvf spark-2.3.1-bin-hadoop2.7.tar
```

Downloaded Hadoop 2.7.1 from Github:

curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe?raw=true

Set my Environmental Variables accordingly:
```
setx SPARK_HOME C:\opt\spark\spark-2.3.1-bin-hadoop2.7
setx HADOOP_HOME C:\opt\spark\spark-2.3.1-bin-hadoop2.7
setx PYSPARK_DRIVER_PYTHON jupyter
setx PYSPARK_DRIVER_PYTHON_OPTS notebook
```
Then added C:\opt\spark\spark-2.3.1-bin-hadoop2.7\bin to my path variables. My environmental user variables now look like this: Current Environmental Variables

These actions should have done the trick, but when I run pyspark --master local[2], I still get the error from above. Can you help to track down this error using the information from above?

Checks

I ran a couple of checks in the command prompt to verify the following:

Java is installed
Anaconda is installed
pip is installed
Python is installed

score 4 · Accepted Answer · edited Jan 19 '19 at 09:22

4

I resolved this issue by setting the variables as "system variables" rather than "user variables". Note

In my case setting variables from command line resulted in "user variables" so I had to use the Advanced settings GUI to enter values as "system variables"
You may want to rule out any installation issue, in which case try to cd into C:\opt\spark\spark-2.3.1-bin-hadoop2.7\bin and run pyspark master local[2] (make sure winutils.exe is there); if that does not work then you have other issues than just env variables

edited Jan 19 '19 at 09:22

Zoe

27,060
21
118
148

answered Jan 19 '19 at 09:21

mchl_k

314
1
10

Thanks a lot for your response! Indeed changing from user to system variables changed it, however now I get the following error: "The system cannot find the path specified". Even though when I manually check with "where python/conda/java etc" It tells me they are there. I googled this error, stackoverflow suggests it's from not [downloading the whole distribution](https://stackoverflow.com/questions/46849585/pyspark-the-system-cannot-find-the-path-specified?noredirect=1&lq=1) but I did so following the tutorial. Any ideas on how to fix this? – Patrick Glettig Jan 20 '19 at 13:00
Can you try findspark as described here: https://github.com/minrk/findspark? In the meantime can you accept this answer as it addresses the exact issue from the question? – mchl_k Feb 13 '19 at 09:13
Thanks for the package, this seams very promising. When I try to run it as suggested in the `ReadMe`, `findspark.init()` an out of range error is returned: **IndexError: list index out of range**. `findspark.find()` works so I tried `findspark.init(findspark.find())` but no luck. – Patrick Glettig Feb 14 '19 at 13:35

Vishnu Kant Tripathi · Answer 2 · 2020-05-04T13:07:19.103

2

Follow the given steps explained in my blog will resolve your problem-

How to Setup PySpark on Windows https://beasparky.blogspot.com/2020/05/how-to-setup-pyspark-in-windows.html

To set up the environment paths for Spark.

Go to "Advanced System Settings" and set below paths
JAVA_HOME="C:\Program Files\Java\jdk1.8.0_181"
HADOOP_HOME="C:\spark-2.4.0-bin-hadoop2.7"
SPARK_HOME="C:\spark-2.4.0-bin-hadoop2.7"
Also, add their bin path into the PATH system variable

edited May 04 '20 at 13:07

answered May 04 '20 at 06:11

Vishnu Kant Tripathi

31
2

Thank you for contributing. Could you summarize the most important steps out of this blog post? Links tend to change or disappear in the future... – Michael Heil May 04 '20 at 06:48
This is good insight. Would you have the time to explain how many of these libraries interact? For example, if the SPARK_HOME is embedded in Anaconda ( C:\Users\mylogin\Anaconda3\lib\site-packages\pyspark ) but also in a spark directory, would there be a different path structure (as above, your spark is in its own folder). – jamiel22 Mar 24 '23 at 15:24

How to troubleshoot 'pyspark' is not recognized... error on Windows?

The Problem

My Actions

Checks

2 Answers2

Linked