1

I installed PySpark under Anaconda by issuing the following commands at a Conda prompt:

conda create -n py39 python=3.9 anaconda
conda activate py39
conda install openjdk
conda install pyspark
conda install -c conda-forge findspark

As can be seen, this is all within the py39 environment. Additionally, I fetched Hadoop 2.7.1 from GitHub and created c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1 to contain the corresponding README.md file and bin subfolder [1]. Here, %HOMEPATH% is \Users\User.Name. Finally, I had to create file %SPARK_HOME%/conf/spark-defaults.conf (Annex A).

With the above setup, I could launch PySpark using the following myspark.cmd script located in c:%HOMEPATH%\anaconda3\envs\py39\bin\:

set "PYSPARK_DRIVER_PYTHON=python"
set "PYSPARK_PYTHON=python"
set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
pyspark

I am now following this page to be able to use Spyder instead of the Conda command line. I am using the following SpyderSpark.cmd script to set the the variables and launch Spyder:

set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=C:%HOMEPATH%\anaconda3\envs\py39\Library"
set "SPARK_HOME=C:%HOMEPATH%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"

C:%HOMEPATH%\anaconda3\pythonw.exe ^
C:%HOMEPATH%\anaconda3\cwp.py ^
C:%HOMEPATH%\anaconda3\envs\py39 ^
C:%HOMEPATH%\anaconda3\envs\py39\pythonw.exe ^
C:%HOMEPATH%\anaconda3\envs\py39\Scripts\spyder-script.py

Some points that may not be clear:

  • Folder %JAVA_HOME%\bin contains java.exe and javac.exe

  • The second half of the above code block is the command that is executed by Anaconda's shortcut for Spyder (py39)

As I am still trying to get SpyderSpark.cmd to work, I execute it from the Conda prompt, specifically the py39 environment. This way, it inherits environment variables that I may have missed in SpyderSpark.cmd. Issuing SpyderSpark.cmd launches the Spyder GUI, but Spark commands aren't recognized at the console. Here is a transcript of the response to the the first few lines of code from this tutorial:

In [1]: columns = ["language","users_count"]
   ...: data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
In [2]: spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
NameError: name 'SparkSession' is not defined

The likely cause is that all but the PYTHONPATH variable propagated their values into the Spyder session. From the Spyder console:

import os
print(os.environ.get("HADOOP_HOME"))
print(os.environ.get("JAVA_HOME"))
print(os.environ.get("SPARK_HOME"))
print(os.environ.get("PYSPARK_DRIVER_PYTHON"))
print(os.environ.get("PYSPARK_PYTHON"))
print(os.environ.get("PYTHONPATH"))

   c:\Users\User.Name\AppData\Local\Hadoop\2.7.1
   C:\Users\User.Name\anaconda3\envs\py39\Library
   C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark
   Python
   Python
   None

Why isn't PYTHONPATH propagating into the Spyder session, and how can I fix this?

I don't think that this Q&A explains the problem because I am launching Spyder from a CMD environment after setting the variable. Furthermore, all the other variables succeed in propagating to the Spyder session.

Notes

[1] Using Cygwin, I found that for all the files in c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1\bin, the permission bits for execution were disabled and needed to be explicitly enabled.

Afternote 2023-09-02:

Respondents posted helpful hints on how to get Spark commands recognized in Spyder, i.e., to first issue from pyspark.sql import SparkSession. I didn't see this tutorial code because it was in a screen capture and the image was blocked by AdBlocker. However, it worked.

That doesn't answer the question of why 1 of 6 environment variables fail to propagate from SpyderSpark.cmd to Spyder, i.e., variable PYTHONPATH. Admittedly, it solved the real showstopper for me at present, for which I thank the respondents. I would still be interested in why PYTHONPATH doesn't propagate.

In case it helps anyone, I found it tricky to create a shortcut to SpyderSpark.cmd that doesn't leave a redundant terminal on the desktop. The solution turned out to be to prefix the Spyder launching command with start:

set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"

start "" ^
%USERPROFILE%\anaconda3\pythonw.exe ^
%USERPROFILE%\anaconda3\cwp.py ^
%USERPROFILE%\anaconda3\envs\py39 ^
%USERPROFILE%\anaconda3\envs\py39\pythonw.exe ^
%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py

All the arguments starting with %USERPROFILE% would ideally be enclosed in double-quotes in case they expand to include non-alphanumeric characters. For some reason, I couldn't do that without incurring the incorrect behaviour in Annex B (below).

With SpyderSpark as revised above, the Target field of the Windows shortcut should contain:

%SystemRoot%\System32\cmd.exe /D /C "%USERPROFILE%\anaconda3\envs\py39\bin\SpyderSpark.cmd"

I found it handy to simply copy the Spyder shortcut and modify the Target field. For the sake of readability, here is the same command broken into two physical lines (which isn't suitable for the Target field of a shortcut):

%SystemRoot%\System32\cmd.exe /D /C ^
"%USERPROFILE%\anaconda3\envs\py39\bin\SpyderSpark.cmd"

Thanks to Mofi for advice on having improved this afternote.

Further troubleshooting 2023-09-03

Following Mofi's advice, I revised SpyderSpark.cmd to use the console oriented python rather than GUI-oriented pythonw for troubleshooting purposes:

set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"

set PYTHONPATH & REM HHHHHHHHHHHHHHHHH
%USERPROFILE%\anaconda3\python.exe ^
%USERPROFILE%\anaconda3\cwp-debug.py ^
%USERPROFILE%\anaconda3\envs\py39 ^
%USERPROFILE%\anaconda3\envs\py39\python.exe ^
%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py

Furthermore, SpyderSpark.cmd was revised to use a modified cwp.py, dubbed cwp-debug.py, wherein PYTHONPATH is printed out twice:

import os
import sys
import subprocess
from os.path import join, pathsep

from menuinst.knownfolders import FOLDERID, get_folder_path, PathNotFoundException

# call as: python cwp.py PREFIX ARGs...

prefix = sys.argv[1]
args = sys.argv[2:]

new_paths = pathsep.join([prefix,
                         join(prefix, "Library", "mingw-w64", "bin"),
                         join(prefix, "Library", "usr", "bin"),
                         join(prefix, "Library", "bin"),
                         join(prefix, "Scripts")])
print(os.environ["PYTHONPATH"]) ###################
env = os.environ.copy()
env['PATH'] = new_paths + pathsep + env['PATH']
env['CONDA_PREFIX'] = prefix

documents_folder, exception = get_folder_path(FOLDERID.Documents)
if exception:
    documents_folder, exception = get_folder_path(FOLDERID.PublicDocuments)
if not exception:
    os.chdir(documents_folder)
print(env["PYTHONPATH"]) ######################
sys.exit(subprocess.call(args, env=env))

When SpyderSpark.cmd is executed from a CMD console, the proper PYTHONPATH is printed out by SpyderSpark.cmd and at both locations in cwp.debug.ph. Furthermore, PYTHONPATH is echoed to the screen when it is prepended to in SpyderSpark.cmd.

The next step was to check whether PYTHONPATH was being clobbered by spyder-scrxipt.py, which is a very short script:

import re
import sys

from spyder.app.start import main

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

I'm actually trying to spin up on Python, so I'm wondering whether anyone can help decipher this code.

Annex A: %SPARK_HOME%/conf/spark-defaults.conf

Here, %SPARK_HOME% is C:%HOMEPATH%\anaconda3\envs\py39\lib\site-packages\pyspark:

spark.eventLog.enabled true
spark.eventLog.dir C:\\Users\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
spark.history.fs.logDirectory C:\\Users\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
spark.sql.autoBroadcastJoinThreshold -1

Annex B: Incorrect behaviour when start arguments are double-quoted in SpyderSpark.cmd

When SpyderSpark.cmd, a terminal console appears with the following messages:

C:\Users\User.Name\Documents\Python Scripts>set "HADOOP_HOME=C:\Users\User.Name\AppData\Local\Hadoop\2.7.1"
C:\Users\User.Name\Documents\Python Scripts>set "JAVA_HOME=C:\Users\User.Name\anaconda3\envs\py39\Library"
C:\Users\User.Name\Documents\Python Scripts>set "SPARK_HOME=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark"
C:\Users\User.Name\Documents\Python Scripts>set "PYSPARK_DRIVER_PYTHON=Python"
C:\Users\User.Name\Documents\Python Scripts>set "PYSPARK_PYTHON=Python"
C:\Users\User.Name\Documents\Python Scripts>set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name\Documents\Python Scripts>set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name\Documents\Python Scripts>start "" "C:\Users\User.Name\anaconda3\pythonw.exe" ^
C:\Users\User.Name\Documents\Python Scripts>"C:\Users\User.Name\anaconda3\cwp.py" "C:\Users\User.Name\anaconda3\envs\py39" ^
[main 2023-09-02T23:29:02.117Z] update#setState idle
[main 2023-09-02T23:29:04.434Z] WSL is not installed, so could not detect WSL profiles

The VS Code app then appears, opened to a file cwp.py (the 2nd argument supplied to start). When I exit VS Code, the following additional messages are printed to the terminal console, followed by the appearance of the Spyder app:

[main 2023-09-02T23:29:09.998Z] Extension host with pid 21404 exited with code: 0, signal: unknown.
C:\Users\User.Name\Documents\Python Scripts>"C:\Users\User.Name\anaconda3\envs\py39\pythonw.exe" "C:\Users\User.Name\anaconda3\envs\py39\Scripts\spyder-script.py"

When I exit Spyder, the terminal console then disappears.

user2153235
  • 388
  • 1
  • 11
  • **Comments have been [moved to chat](https://chat.stackoverflow.com/rooms/255155/discussion-on-question-by-user2153235-pythonpath-not-propagating-from-cmd-to-spy); please do not continue the discussion here.** Before posting a comment below this one, please review the [purposes of comments](/help/privileges/comment). Comments that do not request clarification or suggest improvements usually belong as an [answer](/help/how-to-answer), on [meta], or in [chat]. Comments continuing discussion may be removed. – Machavity Sep 03 '23 at 02:18

1 Answers1

1

there might be a few issues that need to be addressed:

1 - Environment Variable Formatting: In your code, you're using %HOMEPATH% without surrounding it with % signs. It should be %HOMEPATH% to properly reference the home directory. For example, instead of c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1, it should be c:\%HOMEPATH%\AppData\Local\Hadoop\2.7.1.

2- Setting Environment Variables: It's better to set the environment variables before launching PySpark to ensure that they are correctly configured for the PySpark session. In your example, you're setting the variables after running pyspark.

3- Importing SparkSession: The error you encountered (NameError: name 'SparkSession' is not defined) indicates that you haven't imported the necessary modules for Spark. You need to import SparkSession from the pyspark.sql module at the beginning of your script.

# Create and activate the Conda environment
conda create -n py39 python=3.9 anaconda
conda activate py39

# Install necessary packages
conda install openjdk
conda install pyspark

# Set environment variables (adjust paths accordingly)
set "HADOOP_HOME=C:\%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=C:\%HOMEPATH%\anaconda3\envs\py39\Library"
set "SPARK_HOME=C:\%HOMEPATH%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=python"
set "PYSPARK_PYTHON=python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"

# Launch PySpark
pyspark
# Inside the PySpark session
from pyspark.sql import SparkSession

columns = ["language", "users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

# Create a SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

# Your Spark code here
# For example, you can create a DataFrame from the data
df = spark.createDataFrame(data, columns)
df.show()

# Don't forget to stop the SparkSession when done
spark.stop()

After correcting the issues and following these steps, you should be able to properly launch a PySpark session within your Conda environment and interact with Spark components.

  • The prose around item #1 makes no sense. You say they left off `%` signs when they did not, then silently add a `\ ` (that may or may not be correct) instead of changing anything related to `%` symbols. – ShadowRanger Sep 01 '23 at 01:10
  • @residentcode: Thank you for your explanation. I'll respond to your enumerated points: (1) This may be an error. I enclose all references to environment variables with `%`. (2) Script `myspark.cmd` does set environment variables before `pyspark`. The problem is that `pyspark` isn't invoked in `SpyderSpark.cmd`. (3) You are right. My browser did not show the screen shot in the tutorial. I only saw the caption "PySpark application running on Spyder IDE"... – user2153235 Sep 01 '23 at 06:49
  • ...so I started entering code from [here](https://sparkbyexamples.com/pyspark/different-ways-to-create-dataframe-in-pyspark). I could see the screen shot after switching browsers, and the `import` command did indeed make Spark commands recognizable. However, that is just the context for the question, albeit very important. The question is why 1 of 6 variables do not propagate into Spyder. Would you be able to comment on that? – user2153235 Sep 01 '23 at 06:49
  • The directory path assigned to the environment variable `HOMEPATH` begins always with a backslash. There should not be used `c:\%HOMEPATH%` as that results finally after environment variable expansion in `c:\\Users\User.Name` which the Windows file IO functions must later automatically correct most likely multiple times to `c:\Users\User.Name`. It would be better to use `%HOMEDRIVE%%HOMEPATH%` instead of `C:\%HOMEPATH%` or much better `%USERPROFILE%` and `%LOCALAPPDATA%` instead of `C:\%HOMEPATH%\AppData\Local` in the batch script. – Mofi Sep 02 '23 at 16:45