So I've been searching all over to find how to develop your code on your local computer (ubuntu 16.04 in my case) using the IPython console on the spyder IDE (which came with Anaconda) and process it on a cluster (created on Azure HDInsight for instance). I can run pyspark locally with no problems (either through the spark-shell and through spyder), but I would like to know if it is possible to run the code on a Spark/Yarn (?) cluster for faster processing, and having the results showing up in the IPython console on spyder. I found this post here on stack overflow ( Running PySpark on and IDE like Spyder? ) with something that feels like it could solve the issue, but I get an error. I get the error when I launch spyder "normally" and also when launching spyder by the "spark-submit spyder.py" command:
sc = SparkContext(conf=conf)
Traceback (most recent call last):
File "<ipython-input-3-6b825dbb354c>", line 1, in <module>
sc = SparkContext(conf=conf)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
conf, jsc, profiler_cls)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 172, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 235, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 1062, in __call__
answer = self._gateway_client.send_command(command)
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 631, in send_command
response = self.send_command(command)
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 624, in send_command
connection = self._get_connection()
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 579, in _get_connection
connection = self._create_connection()
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 585, in _create_connection
connection.start()
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 697, in start
raise Py4JNetworkError(msg, e)
Py4JNetworkError: An error occurred while trying to connect to the Java server
This is my code:
import os
import sys
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-oracle/"
os.environ["SPARK_HOME"] = "/usr/local/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
os.environ["PYSPARK_PYTHON"] = "python2.7"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
############################################################################
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setMaster('spark://xx.x.x.xx:xxxxx').setAppName("building a warehouse")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
sentenceData = sqlCtx.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
for features_label in rescaledData.select("features", "label").take(3):
print(features_label)
I created the cluster on Azure HDInsight, not sure if the IP and port were retrieved from the right places, or if an SSH tunnel has to be created. It is very confusing.
Hope someone can help me. Thanks in advance!