Increase the Spark workers cores

Question

I have installed Spark on master and 2 workers. The original core number per worker is 8. When I start the master, the workers are work properly without any problem, but the problem is in Spark GUI each worker has only 2 cores assigned.

Kindly, how can I increase the number of the cores in which each worker works with 8 cores?

dre-hh · Accepted Answer · 2019-12-06T08:54:17.450

1

The setting which controls cores per executor is spark.executor.cores. See doc. It can be set either via spark-submit cmd argument or in spark-defaults.conf. The file is usually located in /etc/spark/conf (ymmv). YOu can search for the conf file with find / -type f -name spark-defaults.conf

spark.executor.cores 8

However the setting does not guarantee that each executor will always get all the available cores. This depends on your workload.

If you schedule tasks on a dataframe or rdd, spark will run a parallel task for each partition of the dataframe. A task will be scheduled to an executor (separate jvm) and the executor can run multiple tasks in parallel in jvm threads on each core.

Also an exeucutor will not necessarily run on a separate worker. If there is enough memory, 2 executors can share a worker node.

In order to use all the cores the setup in your case could look as follows:

given you have 10 gig of memory on each node

spark.default.parallelism 14
spark.executor.instances 2
spark.executor.cores 7
spark.executor.memory 9g

Setting memory to 9g will make sure, each executor is assigned to a separate node. Each executor will have 7 cores available. And each dataframe operation will be scheduled to 14 concurrent tasks, which will be distributed x 7 to each executor. You can also repartition a dataframe, instead of setting default.parallelism. One core and 1gig of memory is left for the operating system.

edited Dec 06 '19 at 08:54

answered Dec 05 '19 at 12:02

dre-hh

7,840
2
33
44

First, I would like to thank you for your answer. Actually I have only 10GB,10GB,10 GB memory capacity on the master, slave1, and slave2 respectively. The question is can I use your instruction in this case? and where should I write these spark.default.parallelism 16 spark.executor.cores 8 spark.executor.memory 15g – Jamie Dec 05 '19 at 12:18
in your case it is '9g' as you have only 10g memory and you want to leave some memory for the maschiene itselft. Actually there are more parameters for memory overflow, but try this first . on the master node there must be a `spark-defaults.conf` file. On AWS the file is located on `/etc/spark/conf/spark-defaults.conf` However it does not have to be there, as this depends on your installation. Search for the file on your machine with `find / -type f -name spark-defaults.conf` – dre-hh Dec 05 '19 at 12:48
it is also good to leave a core for each node itself to run the os and scheduling processes. so set the amount of cores to 7 – dre-hh Dec 05 '19 at 12:53
Dear dre-hh, I found spark-defaults.conf and pasted the three-row above as it is only in master note but still dont working and giving same result. What do you think what is the problem should I write it also in slave nodes or should I write with equal (=) sign like spark.default.parallelism=14? – Jamie Dec 05 '19 at 18:24
no, this is not necessary. just checked our config. i forgot one option `spark.executor.instances 2` . added it to the answer. It can also be that your workload is just not big enough to have such parallelism. Are you loading a dataframe? check its number of partitions. In python `df.rdd.getNumPartitions()` see https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=partitions#pyspark.RDD.getNumPartitions – dre-hh Dec 05 '19 at 22:15
@AhmedM.Jamel btw, spark-defaults.conf is usually full of settings. You wrote, you have pasted the lines above. It can be there is already a setting below. It woudl then owerwrite what you entered above. – dre-hh Dec 06 '19 at 10:16
Dear dre-hh , I pasted all the settings and run the master with this command .bin/spark-class org.apache.spark.deploy.master.Master but still don't give me all available core number.is there another option i can do? – Jamie Dec 09 '19 at 12:01
try using `spark-submit` https://spark.apache.org/docs/2.3.0/submitting-applications.html#launching-applications-with-spark-submit instead of `./bin/spark-class` spark-submit accepts conf options via commandline or as config file. by default it reads the conf/spar-defaults.conf file Check at runtime if those setttings were loaded correctly ` spark.sparkContext.getConf().getAll()` – dre-hh Dec 09 '19 at 12:41

Increase the Spark workers cores

1 Answers1

Linked