5

It's been four days that I am struggling with this issue, I looked in several webpages dealing with the same issue even here in Stackoverflow but without getting a solution.

I installed Spark-2.3.0 , Scala 2.12.5 and Hadoop-2.7.1 (for winutils master) then set up the according environment variables. I installed findspark and then launch pyspark in my Jupyter Notebook. The issue is that when I run:

sc = pyspark.SparkContext('local')

I get the following error:

java gateway process exited before sending the driver its port number

I should mention that I'm using Java-1.8.0 and I set in my environment variables :

 PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

Please if you have any idea how I can solve this issue, I will be gratefull. Thank you!

Iriel
  • 171
  • 2
  • 2
  • 13
  • Are you trying to create a new spark context with this line `sc = pyspark.SparkContext('local')`? And is this all just for running spark in jupyter? – ernest_k Apr 03 '18 at 08:43
  • yes I'm tying to create a spark context in order to develop with pyspark . – Iriel Apr 03 '18 at 08:46

1 Answers1

8

The setup is fairly simple and straightforward. Below are steps that you can follow.

Assumed:

  • You have downloaded Spark and extracted its archive into <spark_home>, added the <spark_home>/bin directory to the PATH variable
  • You have installed Jupyter and it can be launched with jupyter notebook from the command line

Steps to be followed:

Export these two variables. This is best done in your user profile script

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

To open jupyter, all you have to do is call

pyspark

If you have additional options, such as master, you can pass them to pyspark:

pyspark --master local[2]

When the notebook opens, spark context is already initialized (as sc), and spark session too (as spark), and you should be able to see something like this:

enter image description here

ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • Thank you very much for your answer. In fact I get the error when I connect to my jupyter notebook directly through Anaconda navigtor in oder words I didn't access to the jupyter notebook by calling pyspark. When I call the jupyter notebook by calling pyspark I no more get the error and I get the same result as yours. – Iriel Apr 03 '18 at 09:19
  • In this case I am a little bit confused because calling jupyter notebook by this two different ways gives two notebooks completly different in term of the content. Please can you clarfy to me in wich of them I can develop in pyspark? – Iriel Apr 03 '18 at 09:22
  • I prefer this way because it minimizes setup efforts. PySpark takes care of exporting paths for you. In fact, I even managed to use pyspark in `pydev` using this method. It isn't intrusive and both your jupyter and spark installations remain clean (to run the raw pyspark shell, all you need to do is remove these 2 environment variables...) – ernest_k Apr 03 '18 at 09:26
  • Thank you very much for your clarification! – Iriel Apr 03 '18 at 09:31