I am running a Jupyter notebook public server as in this tutorial : http://jupyter-notebook.readthedocs.io/en/stable/public_server.html
I want to use pyspark-2.2.1 with this server. I pip-installed py4j and downloaded spark-2.2.1 from the repository.
Locally, i added in my .bashrc the command lines
export SPARK_HOME='/home/ubuntu/spark-2.2.1-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
and everything works fine when i run python locally.
However, when using the notebook server, i cannot import pyspark, because the above commands have not been executed at jupyter notebook's startup.
I partly (and non elegantly) solved the issue by typing
import sys
sys.path.append("/home/ubuntu/spark-2.2.1-bin-hadoop2.7/python")
in the first cell of my notebook. But
from pyspark import SparkContext
sc = SparkContext()
myrdd = sc.textFile('exemple.txt')
myrdd.collect() # Everything works find util here
words = myrdd.map(lambda x:x.split())
words.collect()
returns the error
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Cannot run program "python": error=2, No such file or directory
Any idea how i can set the correct paths (either manually or at startup) ?
Thanks