I installed PySpark 3.2.0 on Windows 10 with Hadoop 3.3.1 following this link. Because of proxy issues, I had to download winutils.exe for a different version of Hadoop (i.e. not the one corresponding to Hadoop 3.3.1).
When I open the command prompt and type pyspark
, there doesn't seem to be any errors (only warnings):
Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/10/24 16:12:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Python version 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021 20:19:38)
Spark context available as 'sc' (master = local[*], app id = local-163510634343847).
SparkSession available as 'spark'.
>>> 21/10/24 16:12:40 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
However, when I try to run another program that uses pyspark, I get the following errors:
py4j.protocol.Py4JJavaError: An error occurred while calling o378.parquet
Caused by: java.lang.UnsatisfiedLinkError
The outputs of java -version
and pyspark --version
are:
java version "1.8.0_301"
Java(TM) SE Runtime Environment (build 1.8.0_301-b25)
Java HotSpot(TM) 64-Bit Server VM (build 25.301-b25, mixed mode)
and
Using Scala version 2.12.15, Java HotSpot(TM) Client VM, 1.8.0_201
Could this error be caused by the mismatch of the Java versions (1.8.0_201 vs 1.8.0_301)? Or is it most likely caused by having the wrong version of winutils.exe
?