11

I use Spark 1.3.0 in a cluster of 5 worker nodes with 36 cores and 58GB of memory each. I'd like to configure Spark's Standalone cluster with many executors per worker.

I have seen the merged SPARK-1706, however it is not immediately clear how to actually configure multiple executors.

Here is the latest configuration of the cluster:

spark.executor.cores = "15"
spark.executor.instances = "10"
spark.executor.memory = "10g"

These settings are set on a SparkContext when the Spark application is submitted to the cluster.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Rich
  • 2,805
  • 8
  • 42
  • 53

4 Answers4

28

You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.

In order to configure the cluster, you can try this:

  • In conf/spark-env.sh:

    • Set the SPARK_WORKER_INSTANCES = 10 which determines the number of Worker instances (#Executors) per node (its default value is only 1)
    • Set the SPARK_WORKER_CORES = 15 # number of cores that one Worker can use (default: all cores, your case is 36)
    • Set SPARK_WORKER_MEMORY = 55g # total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
  • Copy this configuration file to all Worker Nodes, on the same folder

  • Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)

As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)

When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application: Eg:

val conf = new SparkConf()
             .setMaster(...)
             .setAppName(...)
             .set("spark.executor.memory", "2g")
             .set("spark.cores.max", "10")
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
ngtrkhoa
  • 754
  • 5
  • 6
  • 1
    This lets me run multiple _workers_ per physical node with 1 executor per worker. `SPARK-1706` seems to indicate that it should now be possible to run 1 worker per physical node with multiple executors. Do you know how to do this configuration? – Rich Apr 30 '15 at 15:41
  • Oh I see I have been misreading the timing of that PR. Thought it was merged last year, not this month. Looks like that may not be available until Spark 1.4 – Rich Apr 30 '15 at 17:06
  • @Rich Currently, 1 node will have 1 Worker, which is responsible for launching and managing multiple Executors, isn't responsible for running our jobs. The term "worker" you used may refer to Executor, because people using hadoop & spark may use that term to indicate the unit that run the job. – ngtrkhoa Apr 30 '15 at 17:21
  • I most certainly am referring to Worker in the true sense of its definition in Spark which is the JVM responsible for coordinating executor(s) on a worker node in the cluster. The nodes do have multiple Workers in the context of the settings suggested in your answer, which also seems to be the only way to have multiple executors on a single node in Spark Standalone as of Spark 1.3.1. Each of the Worker processes will create 1 executor. I see this both when ./start-all.sh is executed and it creates more Worker processes and from the cluster UI which shows the same. – Rich Apr 30 '15 at 19:03
  • 1
    Thank you for your discussion, But I think there should be only one Worker per node. It's a separate process on Worker node. When this Worker receives a "launch N Executors" request, it will spawn N separate processes - which are N Executors. There is a similar question here, you can have a look, http://stackoverflow.com/questions/24696777/apache-spark-whats-the-relationship-between-worker-worker-instance-executor, answered by Sean Owen - committer of Spark. Ps: When you run the `start-all.sh` script, you will see that on each Worker Node, there is only one jvm process running - the Worker – ngtrkhoa Apr 30 '15 at 21:16
  • I know what the application UI, `start-all.sh` logs, and OS are telling me about how many Worker JVM processes are running. There are 10 Worker processes on 5 nodes. Which is fine, it's the current state of Spark that you cannot get multiple executors _per application_ on a single worker node without also running multiple worker processes per node. This is not contradictory with what Sean Owen was explaining in the other thread. – Rich May 04 '15 at 18:40
  • @Rich In spark 1.5, is it still the case that "you cannot get multiple executors per application on a single worker node without also running multiple worker processes per node" ? if we can, how to setup that? – qqibrow Oct 23 '15 at 05:16
  • @ngtrkhoa Isn't the Spark environment configuration define number of worker instances as Workers instead of Executors (# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node" ? If I understand your first reply correctly, you are considering the number of worker instances as number of Executors. Could you pls explain this? I was also looking for a config setting of number of executors. – hab Dec 01 '15 at 10:53
  • how do you set properties with configuration in spark 2.0 using SparkSession builder? – horatio1701d Aug 04 '16 at 19:32
  • 4
    This answer is mixing up executors and workers in multiple places. Needs to be fixed/clarified. – WestCoastProjects Dec 06 '16 at 17:17
  • 1
    Which one is it then? Multiple worker instances per node with 1 executor per worker, or 1 worker per node with multiple executors? The answer and the comments are different. Setting `SPARK_WORKER_INSTANCES = 10` does not imply hadoop workers... it means exactly spark workers... – vefthym Mar 08 '17 at 10:15
2

In stand-alone mode, by default, all the resources on the cluster are acquired as you launch an application. You need to specify the number of executors you need using the --executor-cores and the --total-executor-cores configs.

For example, if there is 1 worker (1 worker == 1 machine in your cluster, it's a good practice to have only 1 worker per machine) in your cluster which has 3 cores and 3G available in its pool (this is specified in spark-env.sh), when you submit an application with --executor-cores 1 --total-executor-cores 2 --executor-memory 1g, two executors are launched for the application with 1 core and 1g each. Hope this helps!

void
  • 2,403
  • 6
  • 28
  • 53
1

Starting in Spark 1.4 it should be possible to configure this:

Setting: spark.executor.cores

Default: 1 in YARN mode, all the available cores on the worker in standalone mode.

Description: The number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.

http://spark.apache.org/docs/1.4.0/configuration.html#execution-behavior

Rich
  • 2,805
  • 8
  • 42
  • 53
1

Until nowaday, Apache Spark 2.2 Standalone Cluster Mode Deployment don't resolve the issue of the number of EXECUTORS per WORKER,.... but there is an alternative for this, which is: launch Spark Executors Manually:

[usr@lcl ~spark/bin]# ./spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@DRIVER-URL:PORT --executor-id val --hostname localhost-val --cores 41 --app-id app-20170914105902-0000-just-exemple --worker-url spark://Worker@localhost-exemple:34117

I hope that help you !

Yugerten
  • 878
  • 1
  • 11
  • 30