Connecting Jupyter notebook to Spark

Question

I have a machine with Hadoop and Spark installed. Below is my current environment.

python3.6

spark1.5.2

Hadoop 2.7.1.2.3.6.0-3796

I was trying to connect jupyter notebook to connect to spark by building up ipython kernel.

2 new files written.

/root/.ipython/profile_pyspark/ipython_notebook_config.py
/root/.ipython/profile_pyspark/startup/00-pyspark-setup.py
/root/anaconda3/share/jupyter/kernels/pyspark/kernel.json

kernel.json

{
    "display_name": "PySpark (Spark 2.0.0)",
    "language": "python",
    "argv": [
        "/root/anaconda3/bin/python3",
        "-m",
        "ipykernel",
        "--profile=pyspark"
    ],
    "env": {
        "CAPTURE_STANDARD_OUT": "true",
        "CAPTURE_STANDARD_ERR": "true",
        "SEND_EMPTY_OUTPUT": "false",
        "PYSPARK_PYTHON" : "/root/anaconda3/bin/python3",
        "SPARK_HOME": "/usr/hdp/current/spark-client/"
    }
}

00-pyspark-setup.py

import os
import sys
os.environ["PYSPARK_PYTHON"] = "/root/anaconda3/bin/python"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.8.2.1-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
exec(open(os.path.join(spark_home, 'python/pyspark/shell.py')).read())

ipython_notebook_config.py

c = get_config()
c.NotebookApp.port = 80

Then, when i run the following

jupyter notebook --profile=pyspark

The notebook is running well. Then, i change kernel to 'PySpark (Spark 2.0.0)', and it suppose to use the 'sc' spark context . However, when i type 'sc', it does not shows anything.

So, since sc cannot be initiliazed, if i want to run the following, it failed!

nums = sc.parallelize(xrange(1000000))

Can anybody help me how to configure jupyter notebook to talk to Spark?

There seems to be a lot going on here. Try to focus your problem/question more. I suggest moving your Spark 2.0 problems to another question. — Jon, Apr 28 '17 at 16:30

score 1 · Answer 1 · answered Apr 26 '17 at 18:36

1

Just FYI, python 3.6 isnt supported until version spark 2.1.1. See JIRA https://issues.apache.org/jira/browse/SPARK-19019

answered Apr 26 '17 at 18:36

Pushkr

3,591
18
31

score 0 · Answer 2 · edited May 23 '17 at 10:31

There is a number of issues with your question...

1) On top of the answer by Punskr above - Spark 1.5 only works with Python 2; Python 3 support was introduced in Spark 2.0.

2) Even if you switch to Python 2 or upgrade Spark, you will still need to import the relevant modules of Pyspark and initialize the sc variable manually in the notebook

3) You also seem to use an old version of Jupyter, since the profiles functionality is not available in Jupyter >= 4.

To initialise sc "automatically" in Jupyter >=4, see my answer here.

Jon · Answer 3 · 2017-12-18T17:14:27.270

You can make a few environment changes to have pyspark default ipython or a jupyter notebook.

Put the following in your ~/.bashrc

export PYSPARK_PYTHON=python3 ## for python3
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000"

See: pyspark on GitHub

Next, run source ~/.bashrc

Then, when you launch pyspark (or with YARN) it will open up a server for you to connect to.

On a local terminal that has ssh capabilities, run

ssh -N -f -L localhost:8000:localhost:7000 <username>@<host>

If you're on Windows, I recommend MobaXterm or Cygwin.

Open up a web browser, and enter the address localhost:8000 to tunnel into your notebook with Spark

Now some precautions, I've never tried this with Python 3 so this may or may not work for you. Regardless, you should really be using Python 2 on Spark 1.5. My company uses Spark 1.5 as well, as NO ONE uses Python 3 because of it.

Update:

Per @desertnaut's comments, setting

export PYSPARK_DRIVER_PYTHON=ipython

may cause issues if the user ever needs to use spark-submit. A work around, if you want to have both notebooks and spark-submit available is to create two new environment variables. Here is an example of what you may create

export PYSPARK_PYTHON=python3 ## for python3
export ipyspark='PYSPARK_DRIVER_PYTHON=ipython pyspark'
export pynb='PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000"'

where ipyspark and pynb are new commands on a bash terminal.

Setting `PYSPARK_DRIVER_PYTHON` to `ipython` or `jupyter` is a really *bad* practice, which can create serious problems downstream (e.g. [when trying `spark-submit`](https://stackoverflow.com/questions/46772280/spark-submit-cant-locate-local-file/46773025#46773025)); the recommended way is to [create an appropriate Jupyter kernel](https://stackoverflow.com/questions/47824131/configuring-spark-to-work-with-jupyter-notebook-and-anaconda). — desertnaut, Dec 18 '17 at 15:16
Yes, that's a common problem that can arise if you need to use `spark-submit`. At my previous job, we used it interactively so we seldom used `spark-submit`. However, a solution around this issue is to create a new variable, `ipyspark = PYSPARK_DRIVER_PYTHON=ipython pyspark`. I'll explain this as an update to the answer. — Jon, Dec 18 '17 at 17:03

Connecting Jupyter notebook to Spark

3 Answers3

Update: