How should I integrate Jupyter notebook and pyspark on Ubuntu 12.04?

Question

I am new for Pyspark. I installed "bash Anaconda2-4.0.0-Linux-x86_64.sh" on ubuntu. Also installed pyspark. Everything working fine in terminal. I want to work it on jupyter. When I created the profile file in my ubuntu terminal as follows:

wanderer@wanderer-VirtualBox:~$ ipython profile create pyspark
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_config.py'
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_kernel_config.py'

wanderer@wanderer-VirtualBox:~$ export ANACONDA_ROOT=~/anaconda2
wanderer@wanderer-VirtualBox:~$ export PYSPARK_DRIVER_PYTHON=$ANACONDA_ROOT/bin/ipython
wanderer@wanderer-VirtualBox:~$ export PYSPARK_PYTHON=$ANACONDA_ROOT/bin/python

wanderer@wanderer-VirtualBox:~$ cd spark-1.5.2-bin-hadoop2.6/
wanderer@wanderer-VirtualBox:~/spark-1.5.2-bin-hadoop2.6$ PYTHON_OPTS=”notebook” ./bin/pyspark
Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/04/24 15:27:42 INFO SparkContext: Running Spark version 1.5.2
16/04/24 15:27:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

16/04/24 15:27:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:33514 with 530.3 MB RAM, BlockManagerId(driver, localhost, 33514)
16/04/24 15:27:53 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.

In [1]: sc
Out[1]: <pyspark.context.SparkContext at 0x7fc96cc6fd10>

In [2]: print sc.version
1.5.2

In [3]:

Below are the versions of jupyter and ipython

wanderer@wanderer-VirtualBox:~$ jupyter --version
4.1.0

wanderer@wanderer-VirtualBox:~$ ipython --version
4.1.2

I tried to integrate jupyter notebook and pyspark, but every thing failed. I want to workout in jupyter and do not have any idea how to integrate jupyter notebook and pyspark.

Can anyone show how to integrate the above components?

Check this [Link jupyter and pyspark](http://stackoverflow.com/questions/33064031/link-spark-with-ipython-notebook/33065359#33065359) — Alberto Bonsanto, Apr 24 '16 at 12:35
@AlbertoBonsanto ... Excellent... finally the issue is solved and started practicing on pyspark.. The given link cleared my obstacle.!!! — Wanderer, Apr 24 '16 at 17:50

score 13 · Answer 1 · answered Oct 13 '16 at 05:04

13

Just run the command:

PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

answered Oct 13 '16 at 05:04

MyounghoonKim

1,030
16
18

score 10 · Answer 2 · edited Sep 11 '16 at 12:11

10

Add to pyspark the two lines using nano or vim:

PYSPARK_DRIVER_PYTHON="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook"

edited Sep 11 '16 at 12:11

Tunaki

132,869
46
340
423

answered Sep 11 '16 at 11:51

volonte volonte

121
1
8

citynorman · Answer 3 · 2017-10-02T21:57:08.523

4

EDIT 2017-Oct

With Spark 2.2 findspark this works well, no need to those env vars

import findspark
findspark.init('/opt/spark')
import pyspark
sc = pyspark.SparkContext()

OLD

The fastest way I found was to run:

export PYSPARK_DRIVER=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
pyspark

Or equivalent for jupyter. This should open an ipython notebook with pyspark enabled. You might also want to look at Beaker notebook.

edited Oct 02 '17 at 21:57

answered Jul 04 '16 at 22:20

citynorman

4,918
3
38
39

Easier still, run in command line: `IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark`. Found [here](http://npatta01.github.io/2015/08/01/pyspark_jupyter/) – citynorman Jul 21 '16 at 01:48
`IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark` appears to be removed in Spark 2.0+ – Neal May 05 '17 at 18:55

How should I integrate Jupyter notebook and pyspark on Ubuntu 12.04?

3 Answers3