1

I have installed spark on my local machine (Windows 10) following this guide: https://changhsinlee.com/install-pyspark-windows-jupyter/.

When launching the notebook from Anaconda and running:

spark_session = SparkSession\
        .builder\
        .master("local[*]")\
        .appName("Z_PA")\
        .getOrCreate()

It takes forever and does not finish (at least in 60 min).

Prior to this i got the error "java-gateway-process-exited-before...". After reading this tread: "https://stackoverflow.com/questions/31841509/pyspark-exception-java-gateway-process-exited-before-sending-the-driver-its-po " I installed following versions and changed directories without spaces.

I dowloaded and installed:

  • java version "1.8.0_202"
  • Anaconda: conda 4.11.0
  • Python: Python 3.8.5
  • Spark: spark-3.0.3-bin-hadoop2.7
    • winutils.exe (added to bin folder)

Spark is stored in: C:\spark. Java is stored in: C:\Java I have added both in my "Environment vairables: User variables for..."

  • SPARK_HOME=C:\spark\spark-3.0.3-bin-hadoop2.7
  • HADOOP_HOME= C:\spark\spark-3.0.3-bin-hadoop2.7
  • JAVA_HOME=C:\Java\jdk1.8.0_202
  • PYSPARK_DRIVER_PYTHON=jupyter
  • PYSPARK_DRIVER_PYTHON_OPTS=notebook
  • and corresponding \bin filepath of spark and java to my path system variables.

I have installed pyspark and findspark as well. These lines of code executes without any issues:

import findspark
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
findspark.find()
import pyspark # only run after findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession

Anyone who knows why it takes so long time to get a sparksession? Anything in my installation that seems to be incorrect?

Energizer1
  • 285
  • 1
  • 5
  • 15

1 Answers1

2

Here is my "recipe" to install and run pyspark on a windows machine with anaconda:

pre-requisite:

  • make sure anaconda is installed
  • recommend to work in venv (virtual environments)
  • install pyspark

how-to create virtual environment project and install pyspark in venv:

  1. open anaconda cmd prompt
  2. create project directory and navigate to it cd path/to/workspace && mkdir testproject && cd testproject
  3. create virtual environment python -m venv venv
  4. activate virtual environment .\venv\Scripts\activate
  5. install pyspark in venv ``pip install pyspark```

Preparation for pyspark (Spark, Hadoop) on Windows

  1. Create folder for spark and hadoop. e.g. in following path C:/Users/YourName/spark_setup/
  2. Download spark-3.2.0-bin-hadoop2.7.tgz from https://archive.apache.org/dist/spark/spark-3.2.0/ to your spark_setup folder
  3. extract into spark_setup folder using following command tar -xvzf spark-3.2.0-bin-hadoop2.7.tgz
  4. Create a /hadoop/bin folder within your spark_setup folder. e.g. like this C:/Users/YourName/spark_setup/spark-3.2.0-bin-hadoop2.7/hadoop/bin/
  5. Download https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe and place it in C:/Users/YourName/spark_setup/spark-3.2.0-bin-hadoop2.7/hadoop/bin/

Run this python file

from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
import os

if __name__ == '__main__':
    os.environ['PYSPARK_PYTHON'] = <path/to/workspace/testproject/venv/Scripts/python.exe>
    os.environ['HADOOP_HOME'] = "C:/Users/YourName/spark_setup/spark-3.2.0-bin-hadoop2.7/hadoop"
    spark = SparkSession.builder.getOrCreate()
    df = spark.createDataFrame([
        Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
        Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
        Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
    ])
    df.show(2)
d-xa
  • 514
  • 2
  • 7