3

What is the standard development process involving some kind of IDE for spark with python for

  1. Data exploration on the cluster
  2. Application development?

I found the following answers, which do not satisfy me:

a) Zeeplin/Jupiter notbooks running "on the cluster"

b)

I would love to do a) and b) using some locally installed IDE, which communicates with the cluster directly, because I dislike the idea to create local dummy files and to change the code before running it on the cluster. I would also prefer an IDE over a notebook. Is there a standard way to do this or are my answers above already "best practice"?

Andy
  • 789
  • 8
  • 19
John
  • 71
  • 1
  • 2

1 Answers1

1

You should be able to use any IDE with PySpark. Here are some instructions for Eclipse and PyDev:

  • set HADOOP_HOME variable referencing location of winutils.exe
  • set SPARK_HOME variable referencing your local spark folder
  • set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)
  • add %SPARK_HOME%/python/lib/pyspark.zip and %SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter

For the testing purposes you can add code like:

spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..

With the proper configuration file in SPARK_CONF_DIR, it should work with just SparkSession.builder.getOrCreate(). Alternatively you could setup your run configurations to use spark-submit directly. Some websites with similar instructions for other IDEs include:

Andy
  • 789
  • 8
  • 19