Configuring Spark to work with Jupyter Notebook and Anaconda

Question

I've spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here's what my .bash_profile looks like:

PATH="/my/path/to/anaconda3/bin:$PATH"

export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"

export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"

When I type /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell, I can launch Spark just fine in my command line shell. And the output sc is not empty. It seems to work fine.

When I type pyspark, it launches my Jupyter Notebook fine. When I create a new Python3 notebook, this error appears:

[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:

And sc in my Jupyter Notebook is empty.

Can anyone help solve this situation?

Just want to clarify: There is nothing after the colon at the end of the error. I also tried to create my own start-up file using this post and I quote here so you don't have to go look there:

I created a short initialization script init_spark.py as follows:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)
and placed it in the ~/.ipython/profile_default/startup/ directory

When I did this, the error then became:

[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:

What is the rest of the warning text? There's a colon at the end of that line, is there anything that comes after it? — darthbith, Dec 15 '17 at 00:51
What if you delete that `alias` line, or try deleting some of the options from it? Does the error change? — darthbith, Dec 15 '17 at 01:57
Just tried removing the alias. No difference. Still the same error :( — puifais, Dec 15 '17 at 17:16
Here's a related link which could possibly help https://stackoverflow.com/questions/33908156/how-to-load-jar-dependenices-in-ipython-notebook. Adding pyspark-shell to PYSPARK_SUBMIT_ARGS is the key. — Kobe-Wan Kenobi, Jan 25 '18 at 12:24

desertnaut · Answer 1 · 2020-06-02T17:56:10.860

Well, it really gives me pain to see how crappy hacks, like setting PYSPARK_DRIVER_PYTHON=jupyter, have been promoted to "solutions" and tend now to become standard practices, despite the fact that they evidently lead to ugly outcomes, like typing pyspark and ending up with a Jupyter notebook instead of a PySpark shell, plus yet-unseen problems lurking downstream, such as when you try to use spark-submit with the above settings... :(

(Don't get me wrong, it is not your fault and I am not blaming you; I have seen dozens of posts here at SO where this "solution" has been proposed, accepted, and upvoted...).

At the time of writing (Dec 2017), there is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.

The first thing to do is run a jupyter kernelspec list command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):

$ jupyter kernelspec list
Available kernels:
  python2       /usr/lib/python2.7/site-packages/ipykernel/resources
  caffe         /usr/local/share/jupyter/kernels/caffe
  ir            /usr/local/share/jupyter/kernels/ir
  pyspark       /usr/local/share/jupyter/kernels/pyspark
  pyspark2      /usr/local/share/jupyter/kernels/pyspark2
  tensorflow    /usr/local/share/jupyter/kernels/tensorflow

The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe & tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.

The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2 kernel:

{
 "display_name": "PySpark (Spark 2.0)",
 "language": "python",
 "argv": [
  "/opt/intel/intelpython27/bin/python2",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
  "PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
  "PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
  "PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
 }
}

I have not bothered to change my details to /my/path/to etc., and you can already see that there are some differences between our cases (I use Intel Python 2.7, and not Anaconda Python 3), but hopefully you get the idea (BTW, don't worry about the connection_file - I don't use one either).

Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels directory (that way, it should be visible if you run again a jupyter kernelspec list command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):

However, there isn’t a great way to modify the kernelspecs. One approach uses jupyter kernelspec list to find the kernel.json file and then modifies it, e.g. kernels/python3/kernel.json, by hand.

If you don't have already a .../jupyter/kernels folder, you can still install a new kernel using jupyter kernelspec install - haven't tried it, but have a look at this SO answer.

Finally, don't forget to remove all the PySpark-related environment variables from your bash profile (leaving only SPARK_HOME should be OK). And confirm that, when you type pyspark, you find yourself with a PySpark shell, as it should be, and not with a Jupyter notebook...

UPDATE (after comment): If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS setting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:

"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"

One option to make all this "easier" is to use the Apache Toree project — OneCricketeer, Dec 18 '17 at 14:49
@desertnaut I followed you example and setup pyspark kernel with ```Saprk 2.2.1``` and ```Python 3.6```. Can you advise me how to specify the pyspark kernel when starting jupyter notebook from terminal — Khurram Majeed, Jan 21 '18 at 14:32
@KhurramMajeed no need to specify anything from the command line; after running `jupyter notebook` and getting to the Notebook dashboard, when selecting New, you get a pull-down menu of all the existing kernels, where you can specify which one to use (kernels are displayed w their respective `display_name` field from the `kernel.json` file shown above). See the [example here](http://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb) — desertnaut, Jan 21 '18 at 15:39
@cricket_007 as of Toree 0.3.0 support for PySpark (and SparkR) kernels has been discontinued with the following github commit: ```[TOREE-487][TOREE-488] Remove PySpark and SparkR interpreters Instead, please use a supported kernel such IPython or IRKernel``` This post is a lifesaver. — alonso s, Mar 03 '19 at 08:01
This was really really helpful. The only thing I would add is that `locate spark` can be used to identify the right paths. That took me some time, but once I was able to find the right paths, I matched them with what @desertnaut said. Thank you so much for this! — Navaneethan Santhanam, Jul 06 '19 at 06:28

score 10 · Accepted Answer · answered Dec 18 '17 at 17:42

Conda can help correctly manage a lot of dependencies...

Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

Create a conda environment with all needed dependencies apart from spark:

conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0

Activate the environment

$ source activate findspark-jupyter-openjdk8-py3

Launch a Jupyter Notebook server:

$ jupyter notebook

In your browser, create a new Python3 notebook

Try calculating PI with the following script (borrowed from this)

import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

I've set up all three methods mentioned here and have the benefit of opting for any method I feel like with no conflicts whatsoever... at least for now. One note: I used a bash function instead of hard-coding the environmental variables. — Serzhan Akhmetov, Mar 14 '18 at 08:50

score 0 · Answer 3 · edited Mar 21 '21 at 19:59

0

I just conda installed sparkmagic (after re-installing a newer version of Spark).

I think that alone simply works, and it is much simpler than fiddling configuration files by hand.

edited Mar 21 '21 at 19:59

desertnaut

57,590
26
140
166

answered Mar 13 '19 at 04:47

matanster

15,072
19
88
167

Configuring Spark to work with Jupyter Notebook and Anaconda

3 Answers3

Linked

Related