9

Google is literally littered with solutions to this problem, but unfortunately even after trying out all the possibilities, am unable to get it working, so please bear with me and see if something strikes you.

OS: MAC

Spark : 1.6.3 (2.10)

Jupyter Notebook : 4.4.0

Python : 2.7

Scala : 2.12.1

I was able to successfully install and run Jupyter notebook. Next, i tried configuring it to work with Spark, for which i installed spark interpreter using Apache Toree. Now when i try running any RDD operation in notebook, following error is thrown

Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /private/tmp/hadoop-xxxx/nm-local-dir/usercache/xxxx/filecache/33/spark-assembly-1.6.3-hadoop2.2.0.jar

Things already tried: 1. Set PYTHONPATH in .bash_profile 2. Am able to import 'pyspark' in python-cli on local 3. Have tried updating interpreter kernel.json to following

{
  "language": "python",
  "display_name": "Apache Toree - PySpark",
  "env": {
    "__TOREE_SPARK_OPTS__": "",
    "SPARK_HOME": "/Users/xxxx/Desktop/utils/spark",
    "__TOREE_OPTS__": "",
    "DEFAULT_INTERPRETER": "PySpark",
    "PYTHONPATH": "/Users/xxxx/Desktop/utils/spark/python:/Users/xxxx/Desktop/utils/spark/python/lib/py4j-0.9-src.zip:/Users/xxxx/Desktop/utils/spark/python/lib/pyspark.zip:/Users/xxxx/Desktop/utils/spark/bin",
  "PYSPARK_SUBMIT_ARGS": "--master local --conf spark.serializer=org.apache.spark.serializer.KryoSerializer",
    "PYTHON_EXEC": "python"
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh",
    "--profile",
    "{connection_file}"
  ]
}
  1. Have even updated interpreter run.sh to explicitly load py4j-0.9-src.zip and pyspark.zip files. When the opening the PySpark notebook, and creating of SparkContext, I can see the spark-assembly, py4j and pyspark packages being uploaded from local, but still when an action is invoked, somehow pyspark is not found.
halfer
  • 19,824
  • 17
  • 99
  • 186
Saurabh Mishra
  • 1,703
  • 3
  • 17
  • 27

6 Answers6

6

Use findspark lib to bypass all environment setting up process. Here is the link for more information. https://github.com/minrk/findspark

Use it as below.

import findspark
findspark.init('/path_to_spark/spark-x.x.x-bin-hadoopx.x')
from pyspark.sql import SparkSession
kay
  • 76
  • 1
  • 3
2

I tried the following command in Windows to link pyspark on jupyter.

On *nix, use export instead of set

Type below code in CMD/Command Prompt

set PYSPARK_DRIVER_PYTHON=ipython
set PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
furianpandit
  • 161
  • 7
1

Just you need to add:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = 'pyspark-shell'

After that, you can work with Pyspark normally.

Eric Bellet
  • 1,732
  • 5
  • 22
  • 40
0

using:

  • ubuntu 16.04 lts
  • spark-2.2.0-bin-hadoop2.7
  • anaconda Anaconda3 4.4.0 (python3)

added the following to .bashrc (adjust your SPARK_HOME path accordingly):

export SPARK_HOME=/home/gps/spark/spark-2.2.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

then in a terminal window run (adjust path accordingly):

$ /home/gps/spark/spark-2.2.0-bin-hadoop2.7/bin/pyspark 

this will start Jupyter Notebook with pyspark enabled

Grant Shannon
  • 4,709
  • 1
  • 46
  • 36
  • Setting `PYSPARK_DRIVER_PYTHON` to `ipython` or `jupyter` is a really *bad* practice, which can create serious problems downstream (e.g. [when trying `spark-submit`](https://stackoverflow.com/questions/46772280/spark-submit-cant-locate-local-file/46773025#46773025)). – desertnaut Dec 18 '17 at 15:20
0
  1. Create a virtualenv and install pyspark
  2. Then setup kernal

     python -m ipykernel install --user --name your_venv_name --display-name "display_name_in_kernal_list"
    
  3. start notebook

  4. Change kernel using dropdown

        Kernel >> Change Kernel >> list of kernels
    
iammehrabalam
  • 1,285
  • 3
  • 14
  • 25
0

We create a file startjupyter.sh in the path where we have jupyter and keep all environment setting in this file say as stated above

export SPARK_HOME=/home/gps/spark/spark-2.2.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

give a path for error and log file also in it. you can also give the port number where you want to execute the notebook. Save the file and execute ./startjupyter.sh Check the Jupyter.err file it will give the token to access the Jupyter notebook online through url.