Initialize PySpark to predefine the SparkContext variable 'sc'

Question

When using PySpark I'd like a SparkContext to be initialised (in yarn client mode) upon creation of a new notebook.

The following tutorials describe how to do this in past versions of ipython/jupyter < 4

https://www.dataquest.io/blog/pyspark-installation-guide/

https://npatta01.github.io/2015/07/22/setting_up_pyspark/

I'm not quite sure how to achieve the same with notebook > 4 as noted in http://jupyter.readthedocs.io/en/latest/migrating.html#since-jupyter-does-not-have-profiles-how-do-i-customize-it

I can manually create and configure a Sparkcontext but I don't want our analysts to have to worry about this.

Does anyone have any ideas?

desertnaut · Accepted Answer · 2018-06-06T13:06:15.237

Well, the missing profiles functionality in Jupyter also puzzled me in the past, albeit for a different reason - I wanted to be able to switch between different deep learning frameworks (Theano & TensorFlow) on demand; eventually I found the solution (described in a blog post of mine here).

The fact is that, although there are not profiles in Jupyter, the startup files functionality for the IPython kernel is still there, and, since Pyspark employs this particular kernel, it can be used in your case.

So, provided that you already have a working Pyspark kernel for Jupyter, all you have to do is write a short initialization script init_spark.py as follows:

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)

and place it in the ~/.ipython/profile_default/startup/ directory of your users.

You can confirm that now sc is already set after starting a Jupyter notebook:

 In [1]: sc
 Out[1]:<pyspark.context.SparkContext at 0x7fcceb7c5fd0>

 In [2]: sc.version
 Out[2]: u'2.0.0'

A more disciplined way for integrating PySpark & Jupyter notebooks is described in my answers here and here.

A third way is to try Apache Toree (formerly Spark Kernel), as described here (haven't tested it though).

Thanks, this worked really nicely. I'm using the images from https://github.com/jupyter/docker-stacks as a base and have Toree working for Scala notebooks in yarn-client mode already. Re-reading the docs for Toree it says it supports Python interpreters too but I'm not sure why I should use Toree over the standard Python Kernel for PySpark. We have analysts that know both Scala + Python and we want to stay flexible — K2J, Apr 21 '17 at 11:00
The profiles function of the ipython kernel is still there. It's pretty simple to create a kernel for each ipython profile just use the `argv` in `kernel.json` to set the profile, e.g. `"argv": [ "python3", "-m", "ipykernel", "--profile", "spark", "-f", "{connection_file}" ]`. I have a few ipython profiles that I want to keep separate for different purposes and a creating a kernel for each was the simplest way for me to achieve that. — AChampion, Jun 16 '17 at 23:25

Initialize PySpark to predefine the SparkContext variable 'sc'

1 Answers1

Linked