5

The documentation on spark-submit says the following:

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster.

Regarding the pyspark it says the following:

You can also use bin/pyspark to launch an interactive Python shell.

This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?

Denys
  • 4,287
  • 8
  • 50
  • 80
  • 1
    Possible duplicate of [What is the difference between spark-submit and pyspark?](http://stackoverflow.com/questions/26726780/what-is-the-difference-between-spark-submit-and-pyspark) – avrsanjay Sep 26 '16 at 14:06

4 Answers4

9

There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.

zero323
  • 322,348
  • 103
  • 959
  • 935
7

If you are using EMR , there are three things

  1. using pyspark(or spark-shell)
  2. using spark-submit without using --master and --deploy-mode
  3. using spark-submit and using --master and --deploy-mode

although using all the above three will run the application in spark cluster, there is a difference how the driver program works.

  • in 1st and 2nd the driver will be in client mode whereas in 3rd the driver will also be in the cluster.
  • in 1st and 2nd, you will have to wait untill one application complete to run another, but in 3rd you can run multiple applications in parallel.
braj
  • 2,545
  • 2
  • 29
  • 40
4

Just adding a clarification that others have not addressed (you may already know this, but it was unclear from the wording of your question):

..when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?

As with spark-submit, standard Python code will run only on the driver. When you call operations through the various pyspark APIs, you will trigger transformations or actions that will be registered/executed on the cluster.

As others have pointed out, spark-submit can also launch jobs in cluster mode. In this case, driver still executes standard Python code, but the driver is a different machine to the one that you call spark-submit from.

timchap
  • 503
  • 2
  • 11
1
  1. Pyspark compare to Scala spark and Java Spark have extreme differences, for Python spark in only support YARN for scheduling the cluster.

  2. If you are running python spark on a local machine, then you can use pyspark. If in the cluster, use the spark-submit.

  3. If you have any dependencies in your python spark job, you need a zip file for submission.

SharpLu
  • 1,136
  • 2
  • 12
  • 28