I have installed spark on my local machine (Windows 10) following this guide: https://changhsinlee.com/install-pyspark-windows-jupyter/.
When launching the notebook from Anaconda and running:
spark_session = SparkSession\
.builder\
.master("local[*]")\
.appName("Z_PA")\
.getOrCreate()
It takes forever and does not finish (at least in 60 min).
Prior to this i got the error "java-gateway-process-exited-before...". After reading this tread: "https://stackoverflow.com/questions/31841509/pyspark-exception-java-gateway-process-exited-before-sending-the-driver-its-po " I installed following versions and changed directories without spaces.
I dowloaded and installed:
- java version "1.8.0_202"
- Anaconda: conda 4.11.0
- Python: Python 3.8.5
- Spark: spark-3.0.3-bin-hadoop2.7
- winutils.exe (added to bin folder)
Spark is stored in: C:\spark. Java is stored in: C:\Java I have added both in my "Environment vairables: User variables for..."
- SPARK_HOME=C:\spark\spark-3.0.3-bin-hadoop2.7
- HADOOP_HOME= C:\spark\spark-3.0.3-bin-hadoop2.7
- JAVA_HOME=C:\Java\jdk1.8.0_202
- PYSPARK_DRIVER_PYTHON=jupyter
- PYSPARK_DRIVER_PYTHON_OPTS=notebook
- and corresponding \bin filepath of spark and java to my path system variables.
I have installed pyspark and findspark as well. These lines of code executes without any issues:
import findspark
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
findspark.find()
import pyspark # only run after findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
Anyone who knows why it takes so long time to get a sparksession? Anything in my installation that seems to be incorrect?