5

The common way of running a spark job appears to be using spark-submit as below (source):

spark-submit --py-files pyfile.py,zipfile.zip main.py --arg1 val1

Being newer to spark, I wanted to know why this first method is preferred over running it from python (example):

python pyfile-that-uses-pyspark.py

The former method yields many more examples when googling the topic, but not explicitly stated reasons for it. In fact, here is another Stack Overflow question where one answer, repeated below, specifically tells the OP not to use the python method, but does not give a reason why.

dont run your py file as: python filename.py instead use: spark-submit filename.py

Can someone provide insight?

Mint
  • 1,928
  • 1
  • 13
  • 12
  • 1
    No, but I developed some understanding on my own, although I'd have to look into it again to remember. If I recollect it has to do with the cluster, configuring the cluster, and providing all the needed python packages to the nodes. With spark submit it's easy to do all those things and the python packages dependencies can be submitted as a zip. – Mint Aug 29 '19 at 20:44
  • 1
    Yes, I should come back and provide an answer for others, because it is a natural question to ask when first approaching spark – Mint Aug 30 '19 at 20:33

2 Answers2

1

@mint Your comment is more or less correct.

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one.

As I understand, using python pyfile-that-uses-pyspark.py cannt launch an application on a cluster, or it's at least more difficult to do so.

RNHTTR
  • 2,235
  • 2
  • 15
  • 30
1

The slightly longer answer, other than saying the Anaconda docs linked are wrong, and the official documentation never tells you to use python, is that Spark requires a JVM.

spark-submit is a wrapper around a JVM process that sets up the classpath, downloads packages, verifies some configuration, among other things. Running python bypasses this, and would have to all be re-built into pyspark/__init__.py so that those processes get ran when imported.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thanks, but then how does it work when using Jupyter notebooks? For example here https://codelabs.developers.google.com/codelabs/spark-jupyter-dataproc#5 – Michael Dec 08 '22 at 08:08
  • @Michael That would depend on the Jupyter kernelspec. It needs defined with `PYTHONPATH` to include pyspark and py4j libraries. Or you can have it run a shell process invoking `pyspark` directly. Refer Apache Toree project as an example. – OneCricketeer Dec 08 '22 at 17:11