I have a machine with Hadoop and Spark installed. Below is my current environment.
python3.6
spark1.5.2
Hadoop 2.7.1.2.3.6.0-3796
I was trying to connect jupyter notebook to connect to spark by building up ipython kernel.
2 new files written.
/root/.ipython/profile_pyspark/ipython_notebook_config.py
/root/.ipython/profile_pyspark/startup/00-pyspark-setup.py
/root/anaconda3/share/jupyter/kernels/pyspark/kernel.json
kernel.json
{
"display_name": "PySpark (Spark 2.0.0)",
"language": "python",
"argv": [
"/root/anaconda3/bin/python3",
"-m",
"ipykernel",
"--profile=pyspark"
],
"env": {
"CAPTURE_STANDARD_OUT": "true",
"CAPTURE_STANDARD_ERR": "true",
"SEND_EMPTY_OUTPUT": "false",
"PYSPARK_PYTHON" : "/root/anaconda3/bin/python3",
"SPARK_HOME": "/usr/hdp/current/spark-client/"
}
}
00-pyspark-setup.py
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/root/anaconda3/bin/python"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.8.2.1-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
exec(open(os.path.join(spark_home, 'python/pyspark/shell.py')).read())
ipython_notebook_config.py
c = get_config()
c.NotebookApp.port = 80
Then, when i run the following
jupyter notebook --profile=pyspark
The notebook is running well. Then, i change kernel to 'PySpark (Spark 2.0.0)', and it suppose to use the 'sc' spark context . However, when i type 'sc', it does not shows anything.
So, since sc cannot be initiliazed, if i want to run the following, it failed!
nums = sc.parallelize(xrange(1000000))
Can anybody help me how to configure jupyter notebook to talk to Spark?