Jupyter pyspark : no module named pyspark

Question

Google is literally littered with solutions to this problem, but unfortunately even after trying out all the possibilities, am unable to get it working, so please bear with me and see if something strikes you.

OS: MAC

Spark : 1.6.3 (2.10)

Jupyter Notebook : 4.4.0

Python : 2.7

Scala : 2.12.1

I was able to successfully install and run Jupyter notebook. Next, i tried configuring it to work with Spark, for which i installed spark interpreter using Apache Toree. Now when i try running any RDD operation in notebook, following error is thrown

Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /private/tmp/hadoop-xxxx/nm-local-dir/usercache/xxxx/filecache/33/spark-assembly-1.6.3-hadoop2.2.0.jar

Things already tried: 1. Set PYTHONPATH in .bash_profile 2. Am able to import 'pyspark' in python-cli on local 3. Have tried updating interpreter kernel.json to following

{
  "language": "python",
  "display_name": "Apache Toree - PySpark",
  "env": {
    "__TOREE_SPARK_OPTS__": "",
    "SPARK_HOME": "/Users/xxxx/Desktop/utils/spark",
    "__TOREE_OPTS__": "",
    "DEFAULT_INTERPRETER": "PySpark",
    "PYTHONPATH": "/Users/xxxx/Desktop/utils/spark/python:/Users/xxxx/Desktop/utils/spark/python/lib/py4j-0.9-src.zip:/Users/xxxx/Desktop/utils/spark/python/lib/pyspark.zip:/Users/xxxx/Desktop/utils/spark/bin",
  "PYSPARK_SUBMIT_ARGS": "--master local --conf spark.serializer=org.apache.spark.serializer.KryoSerializer",
    "PYTHON_EXEC": "python"
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh",
    "--profile",
    "{connection_file}"
  ]
}

Have even updated interpreter run.sh to explicitly load py4j-0.9-src.zip and pyspark.zip files. When the opening the PySpark notebook, and creating of SparkContext, I can see the spark-assembly, py4j and pyspark packages being uploaded from local, but still when an action is invoked, somehow pyspark is not found.

score 6 · Accepted Answer · answered Nov 25 '17 at 03:53

Use findspark lib to bypass all environment setting up process. Here is the link for more information. https://github.com/minrk/findspark

Use it as below.

import findspark
findspark.init('/path_to_spark/spark-x.x.x-bin-hadoopx.x')
from pyspark.sql import SparkSession

score 2 · Answer 2 · edited Jul 09 '17 at 12:56

2

I tried the following command in Windows to link pyspark on jupyter.

On *nix, use export instead of set

Type below code in CMD/Command Prompt

set PYSPARK_DRIVER_PYTHON=ipython
set PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark

edited Jul 09 '17 at 12:56

OneCricketeer

179,855
19
132
245

answered Jul 09 '17 at 12:50

furianpandit

161
7

1

This isn't using Jupyter, only ipython – OneCricketeer Jul 09 '17 at 12:57
When you execute this commands, it will open jupyter notebook in browser. As far as my understanding jupyter notebook is using ipython in background. If I am wrong then please correct me because i have already used this command – furianpandit Jul 09 '17 at 13:06
In my experience, (at least the first and third line here) will stay in the terminal and give you an ipython prompt for Pyspark – OneCricketeer Jul 09 '17 at 13:09
Yes you are right, actually second line where i have mentioned notebook that leads to jupyter notebook on browser. – furianpandit Jul 09 '17 at 13:10
Got it... Anyways, the Apache Toree install sets this up as well – OneCricketeer Jul 09 '17 at 13:11

score 1 · Answer 3 · answered May 15 '18 at 15:30

1

Just you need to add:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = 'pyspark-shell'

After that, you can work with Pyspark normally.

answered May 15 '18 at 15:30

Eric Bellet

1,732
5
22
40

score 0 · Answer 4 · answered Oct 13 '17 at 03:11

0

using:

ubuntu 16.04 lts
spark-2.2.0-bin-hadoop2.7
anaconda Anaconda3 4.4.0 (python3)

added the following to .bashrc (adjust your SPARK_HOME path accordingly):

export SPARK_HOME=/home/gps/spark/spark-2.2.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

then in a terminal window run (adjust path accordingly):

$ /home/gps/spark/spark-2.2.0-bin-hadoop2.7/bin/pyspark

this will start Jupyter Notebook with pyspark enabled

answered Oct 13 '17 at 03:11

Grant Shannon

4,709
1
46
36

Setting `PYSPARK_DRIVER_PYTHON` to `ipython` or `jupyter` is a really *bad* practice, which can create serious problems downstream (e.g. [when trying `spark-submit`](https://stackoverflow.com/questions/46772280/spark-submit-cant-locate-local-file/46773025#46773025)). – desertnaut Dec 18 '17 at 15:20

score 0 · Answer 5 · answered Feb 12 '18 at 17:58

Create a virtualenv and install pyspark

Then setup kernal

 python -m ipykernel install --user --name your_venv_name --display-name "display_name_in_kernal_list"

start notebook

Change kernel using dropdown

    Kernel >> Change Kernel >> list of kernels

score 0 · Answer 6 · answered Aug 17 '18 at 09:33

We create a file startjupyter.sh in the path where we have jupyter and keep all environment setting in this file say as stated above

export SPARK_HOME=/home/gps/spark/spark-2.2.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

give a path for error and log file also in it. you can also give the port number where you want to execute the notebook. Save the file and execute ./startjupyter.sh Check the Jupyter.err file it will give the token to access the Jupyter notebook online through url.

Jupyter pyspark : no module named pyspark

6 Answers6