I installed PySpark under Anaconda by issuing the following commands at a Conda prompt:
conda create -n py39 python=3.9 anaconda
conda activate py39
conda install openjdk
conda install pyspark
conda install -c conda-forge findspark
As can be seen, this is all within the py39
environment.
Additionally, I fetched Hadoop 2.7.1 from
GitHub and created
c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1
to contain the corresponding
README.md
file and bin
subfolder [1]. Here, %HOMEPATH%
is
\Users\User.Name
. Finally, I had to create file
%SPARK_HOME%/conf/spark-defaults.conf
(Annex A).
With the above setup, I could launch PySpark using the following
myspark.cmd
script located in
c:%HOMEPATH%\anaconda3\envs\py39\bin\
:
set "PYSPARK_DRIVER_PYTHON=python"
set "PYSPARK_PYTHON=python"
set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
pyspark
I am now following this
page
to be able to use Spyder instead of the Conda command line. I am
using the following SpyderSpark.cmd
script to set the the variables
and launch Spyder:
set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=C:%HOMEPATH%\anaconda3\envs\py39\Library"
set "SPARK_HOME=C:%HOMEPATH%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"
C:%HOMEPATH%\anaconda3\pythonw.exe ^
C:%HOMEPATH%\anaconda3\cwp.py ^
C:%HOMEPATH%\anaconda3\envs\py39 ^
C:%HOMEPATH%\anaconda3\envs\py39\pythonw.exe ^
C:%HOMEPATH%\anaconda3\envs\py39\Scripts\spyder-script.py
Some points that may not be clear:
Folder
%JAVA_HOME%\bin
containsjava.exe
andjavac.exe
The second half of the above code block is the command that is executed by Anaconda's shortcut for
Spyder (py39)
As I am still trying to get SpyderSpark.cmd
to work, I execute it
from the Conda prompt, specifically the py39
environment. This way,
it inherits environment variables that I may have missed in
SpyderSpark.cmd
. Issuing SpyderSpark.cmd
launches the Spyder GUI,
but Spark commands aren't recognized at the console. Here is a
transcript of the response to the the first few lines of code from
this
tutorial:
In [1]: columns = ["language","users_count"]
...: data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
In [2]: spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
NameError: name 'SparkSession' is not defined
The likely cause is that all but the PYTHONPATH
variable propagated
their values into the Spyder session. From the Spyder console:
import os
print(os.environ.get("HADOOP_HOME"))
print(os.environ.get("JAVA_HOME"))
print(os.environ.get("SPARK_HOME"))
print(os.environ.get("PYSPARK_DRIVER_PYTHON"))
print(os.environ.get("PYSPARK_PYTHON"))
print(os.environ.get("PYTHONPATH"))
c:\Users\User.Name\AppData\Local\Hadoop\2.7.1
C:\Users\User.Name\anaconda3\envs\py39\Library
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark
Python
Python
None
Why isn't PYTHONPATH
propagating into the Spyder session, and how
can I fix this?
I don't think that this Q&A explains the problem because I am launching Spyder from a CMD environment after setting the variable. Furthermore, all the other variables succeed in propagating to the Spyder session.
Notes
[1] Using Cygwin, I found that for all the files in
c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1\bin
, the permission bits for
execution were disabled and needed to be explicitly enabled.
Afternote 2023-09-02:
Respondents posted helpful hints on how to get Spark commands
recognized in Spyder, i.e., to first issue from pyspark.sql import SparkSession
. I didn't see this tutorial code because it was in a screen capture and the image was blocked by AdBlocker. However, it worked.
That doesn't answer the question of why 1 of 6
environment variables fail to propagate from SpyderSpark.cmd
to Spyder,
i.e., variable PYTHONPATH
. Admittedly, it solved the real
showstopper for me at present, for which I thank the respondents.
I would still be interested in why PYTHONPATH
doesn't propagate.
In case it helps anyone, I found it tricky to create a shortcut to
SpyderSpark.cmd
that doesn't leave a redundant terminal on the
desktop. The solution turned out to be to prefix the Spyder launching
command with start
:
set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"
start "" ^
%USERPROFILE%\anaconda3\pythonw.exe ^
%USERPROFILE%\anaconda3\cwp.py ^
%USERPROFILE%\anaconda3\envs\py39 ^
%USERPROFILE%\anaconda3\envs\py39\pythonw.exe ^
%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py
All the arguments starting with %USERPROFILE%
would ideally be
enclosed in double-quotes in case they expand to include
non-alphanumeric characters. For some reason, I couldn't do that
without incurring the incorrect behaviour in Annex B (below).
With SpyderSpark as revised above, the Target
field of the Windows shortcut
should contain:
%SystemRoot%\System32\cmd.exe /D /C "%USERPROFILE%\anaconda3\envs\py39\bin\SpyderSpark.cmd"
I found it handy to simply copy the Spyder shortcut and modify the
Target
field. For the sake of readability, here is the same command
broken into two physical lines (which isn't suitable for the Target
field of a shortcut):
%SystemRoot%\System32\cmd.exe /D /C ^
"%USERPROFILE%\anaconda3\envs\py39\bin\SpyderSpark.cmd"
Thanks to Mofi for advice on having improved this afternote.
Further troubleshooting 2023-09-03
Following Mofi's advice, I revised SpyderSpark.cmd
to use the
console oriented python
rather than GUI-oriented pythonw
for
troubleshooting purposes:
set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"
set PYTHONPATH & REM HHHHHHHHHHHHHHHHH
%USERPROFILE%\anaconda3\python.exe ^
%USERPROFILE%\anaconda3\cwp-debug.py ^
%USERPROFILE%\anaconda3\envs\py39 ^
%USERPROFILE%\anaconda3\envs\py39\python.exe ^
%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py
Furthermore, SpyderSpark.cmd
was revised
to use a modified cwp.py
, dubbed cwp-debug.py
, wherein
PYTHONPATH
is printed out twice:
import os
import sys
import subprocess
from os.path import join, pathsep
from menuinst.knownfolders import FOLDERID, get_folder_path, PathNotFoundException
# call as: python cwp.py PREFIX ARGs...
prefix = sys.argv[1]
args = sys.argv[2:]
new_paths = pathsep.join([prefix,
join(prefix, "Library", "mingw-w64", "bin"),
join(prefix, "Library", "usr", "bin"),
join(prefix, "Library", "bin"),
join(prefix, "Scripts")])
print(os.environ["PYTHONPATH"]) ###################
env = os.environ.copy()
env['PATH'] = new_paths + pathsep + env['PATH']
env['CONDA_PREFIX'] = prefix
documents_folder, exception = get_folder_path(FOLDERID.Documents)
if exception:
documents_folder, exception = get_folder_path(FOLDERID.PublicDocuments)
if not exception:
os.chdir(documents_folder)
print(env["PYTHONPATH"]) ######################
sys.exit(subprocess.call(args, env=env))
When SpyderSpark.cmd
is executed from a CMD console, the proper
PYTHONPATH
is printed out by SpyderSpark.cmd
and at both
locations in cwp.debug.ph
. Furthermore, PYTHONPATH
is echoed
to the screen when it is prepended to in SpyderSpark.cmd
.
The next step was to check whether PYTHONPATH
was being clobbered
by spyder-scrxipt.py
, which is a very short script:
import re
import sys
from spyder.app.start import main
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
sys.exit(main())
I'm actually trying to spin up on Python, so I'm wondering whether anyone can help decipher this code.
Annex A: %SPARK_HOME%/conf/spark-defaults.conf
Here, %SPARK_HOME%
is
C:%HOMEPATH%\anaconda3\envs\py39\lib\site-packages\pyspark
:
spark.eventLog.enabled true
spark.eventLog.dir C:\\Users\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
spark.history.fs.logDirectory C:\\Users\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
spark.sql.autoBroadcastJoinThreshold -1
Annex B: Incorrect behaviour when start
arguments are double-quoted in SpyderSpark.cmd
When SpyderSpark.cmd
, a terminal console appears with the following
messages:
C:\Users\User.Name\Documents\Python Scripts>set "HADOOP_HOME=C:\Users\User.Name\AppData\Local\Hadoop\2.7.1"
C:\Users\User.Name\Documents\Python Scripts>set "JAVA_HOME=C:\Users\User.Name\anaconda3\envs\py39\Library"
C:\Users\User.Name\Documents\Python Scripts>set "SPARK_HOME=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark"
C:\Users\User.Name\Documents\Python Scripts>set "PYSPARK_DRIVER_PYTHON=Python"
C:\Users\User.Name\Documents\Python Scripts>set "PYSPARK_PYTHON=Python"
C:\Users\User.Name\Documents\Python Scripts>set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name\Documents\Python Scripts>set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name\Documents\Python Scripts>start "" "C:\Users\User.Name\anaconda3\pythonw.exe" ^
C:\Users\User.Name\Documents\Python Scripts>"C:\Users\User.Name\anaconda3\cwp.py" "C:\Users\User.Name\anaconda3\envs\py39" ^
[main 2023-09-02T23:29:02.117Z] update#setState idle
[main 2023-09-02T23:29:04.434Z] WSL is not installed, so could not detect WSL profiles
The VS Code app then appears, opened to a file cwp.py
(the 2nd
argument supplied to start
). When I exit VS Code, the following
additional messages are printed to the terminal console, followed by
the appearance of the Spyder app:
[main 2023-09-02T23:29:09.998Z] Extension host with pid 21404 exited with code: 0, signal: unknown.
C:\Users\User.Name\Documents\Python Scripts>"C:\Users\User.Name\anaconda3\envs\py39\pythonw.exe" "C:\Users\User.Name\anaconda3\envs\py39\Scripts\spyder-script.py"
When I exit Spyder, the terminal console then disappears.