2

So I've been searching all over to find how to develop your code on your local computer (ubuntu 16.04 in my case) using the IPython console on the spyder IDE (which came with Anaconda) and process it on a cluster (created on Azure HDInsight for instance). I can run pyspark locally with no problems (either through the spark-shell and through spyder), but I would like to know if it is possible to run the code on a Spark/Yarn (?) cluster for faster processing, and having the results showing up in the IPython console on spyder. I found this post here on stack overflow ( Running PySpark on and IDE like Spyder? ) with something that feels like it could solve the issue, but I get an error. I get the error when I launch spyder "normally" and also when launching spyder by the "spark-submit spyder.py" command:

 sc = SparkContext(conf=conf)
Traceback (most recent call last):

  File "<ipython-input-3-6b825dbb354c>", line 1, in <module>
    sc = SparkContext(conf=conf)

  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
    conf, jsc, profiler_cls)

  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 172, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)

  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 235, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)

  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 1062, in __call__
    answer = self._gateway_client.send_command(command)

  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 631, in send_command
    response = self.send_command(command)

  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 624, in send_command
    connection = self._get_connection()

  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 579, in _get_connection
    connection = self._create_connection()

  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 585, in _create_connection
    connection.start()

  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 697, in start
    raise Py4JNetworkError(msg, e)

Py4JNetworkError: An error occurred while trying to connect to the Java server

This is my code:

import os
import sys
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-oracle/"
os.environ["SPARK_HOME"] = "/usr/local/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
os.environ["PYSPARK_PYTHON"] = "python2.7"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
############################################################################

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

conf = SparkConf().setMaster('spark://xx.x.x.xx:xxxxx').setAppName("building a warehouse")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)

from pyspark.ml.feature import HashingTF, IDF, Tokenizer

sentenceData = sqlCtx.createDataFrame([
    (0, "Hi I heard about Spark"),
    (0, "I wish Java could use case classes"),
    (1, "Logistic regression models are neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
for features_label in rescaledData.select("features", "label").take(3):
    print(features_label)

I created the cluster on Azure HDInsight, not sure if the IP and port were retrieved from the right places, or if an SSH tunnel has to be created. It is very confusing.

Hope someone can help me. Thanks in advance!

Community
  • 1
  • 1
  • Did you ever find an answer to this? – Thomas May 09 '18 at 09:56
  • 1
    Not really, no. Never used Spyder ever since. But you can use Spark, or any other library from inside a cluster easily with Jupyter Notebook. Create the cluster, install Jupyter in it and the packages you need, launch it and the IP will be accessible on your browser (assuming you configured port forwarding correctly in the cluster) – Hugo Veiga May 10 '18 at 10:42

0 Answers0