Spark development process with Python and IDE

Question

What is the standard development process involving some kind of IDE for spark with python for

Data exploration on the cluster
Application development?

I found the following answers, which do not satisfy me:

a) Zeeplin/Jupiter notbooks running "on the cluster"

b)

Install Spark and PyCharm locally,
- use some local files containing dummy data to develope locally,
- change references in the code to some real files on the cluster,
- execute script using spark-submit in the console on the cluster.
- source: https://de.hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-python/

I would love to do a) and b) using some locally installed IDE, which communicates with the cluster directly, because I dislike the idea to create local dummy files and to change the code before running it on the cluster. I would also prefer an IDE over a notebook. Is there a standard way to do this or are my answers above already "best practice"?

score 1 · Answer 1 · answered Nov 26 '18 at 22:03

You should be able to use any IDE with PySpark. Here are some instructions for Eclipse and PyDev:

set HADOOP_HOME variable referencing location of winutils.exe
set SPARK_HOME variable referencing your local spark folder
set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)
add %SPARK_HOME%/python/lib/pyspark.zip and %SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter

For the testing purposes you can add code like:

spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..

With the proper configuration file in SPARK_CONF_DIR, it should work with just SparkSession.builder.getOrCreate(). Alternatively you could setup your run configurations to use spark-submit directly. Some websites with similar instructions for other IDEs include:

Spark development process with Python and IDE

1 Answers1