-1

I have a doubt to understand what is client mode and cluster mode. Let take an example:

I have test.py with the following:

if __name__ == "__main__":

conf = (SparkConf()
     .setAppName(appName)
     .set("spark.executor.memory", ?)
     .set('spark.driver.memory', ?)
     .set('spark.executor.memoryOverhead',?)
     .set("spark.network.timeout", ?)
     .set("spark.files.overwrite", ?)
     .set("spark.executor.heartbeatInterval", ?)
     .set("spark.driver.maxResultSize", ?)
     .set("spark.executor.instances", ?)
     .set("spark.executor.cores", ?)
     .set("spark.driver.cores", ?)
     .set("spark.sql.shuffle.partitions", ?)
     )
spark = SparkSession.builder.config(conf=conf).getOrCreate()

start_time = time.time()
sc = spark.sparkContext
sqlContext = SQLContext(sparkContext = sc)

Im working on an SSH linux server. To be able tu run test.py, I can do two options:

1- Reserve a node using the following command:

salloc --time=03:00:00 --cpus-per-task=32 --mem=0 --account=def-myName

This command allows me to reserve a node for three hours. This node has the following specification:

Cores: 32
Available memory:   125 gb
CPU type:   2 x Intel E5-2683 v4 "Broadwell" @ 2.1Ghz                         
Storage: 2 x 480GB SSD

now to run test.py, I just type spark-submit test.py. Does this way is called client mode or cluster mode? If it is client mode how I can set:

Master Memory:
Master Cores:
Number of Worker Nodes:
Memory per worker node (gb):
Cores per worker node:

2- I can run a job.sh where It is defined as follows:

    #SBATCH --nodes=1
    #SBATCH --time=
    #SBATCH --mem=128000M
    #SBATCH --cpus-per-task=
    #SBATCH --ntasks-per-node=
    #SBATCH --output=sparkjob-%j.out
    #SBATCH --mail-type=ALL
    #SBATCH --error=
    ## send mail to this address
    #SBATCH --mail-user=

    spark-submit --total-executor-cores xxx --driver-memory xxxx test.py
....

Then I execute the code by sbatch job.sh. Does this way is called cluster way?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
moudi
  • 137
  • 11

2 Answers2

1

In client mode the driver (executing your local tasks) is set on the server from which you ran spark-submit. The executors are allocated dynamically by your resource manager (yarn or mesos) on any nodes that have the resources that you requested.

In cluster mode the driver is also allocated dynamically by your resource manager and can therefore be on any node of your cluster.

You can read more here https://stackoverflow.com/a/41142747/8467558.

The spark-submit inline command for deploy-mode is --deploy-mode

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).

If it isn't set it will default to your spark-defaults.conf configuration value spark.submit.deployMode. If there is no default configuration or if this value isn't set it will be client.

MaFF
  • 9,551
  • 2
  • 32
  • 41
1

Some supplementary information about when to use the one or the other option.

As already mentioned when you run your spark-submit in client mode the driver will be running in the machine that you executed the spark-submit command. That also means that you will be able to monitor the execution of your job from the same machine through the command line. And consequently if you terminate your command line you terminate the driver and eventually the Spark job. So you should not use client mode in production.

In the case of the cluster mode the driver will run somewhere in an arbitrary node on the cluster. This implies that you need another way to monitor your Spark job i.e Spark UI.

As you might already guessed the client mode is useful for testing your jobs in your local machine when the cluster mode is used in production and/or testing environments.

And to answer your question:

1) The default mode is client mode so when you type:

spark-submit --total-executor-cores xxx --driver-memory xxxx test.py

This will be executed in client mode.

2) If you want to execute your job in cluster mode you must type:

spark-submit --total-executor-cores xxx --driver-memory xxxx --deploy-mode cluster test.py
abiratsis
  • 7,051
  • 3
  • 28
  • 46