I want to run spark-shell in yarn mode with a certain number of cores.
the command I use is as follows
spark-shell --num-executors 25 --executor-cores 4 --executor-memory 1G \
--driver-memory 1G --conf spark.yarn.executor.memoryOverhead=2048 --master yarn \
--conf spark.driver.maxResultSize=10G \
--conf spark.serializer=org.apache.spark.serializer.KyroSerializer \
-i input.scala
input.scala
looks something like this
import java.io.ByteArrayInputStream
// Plaintext sum on 10M rows
def aggrMapPlain(iter: Iterator[Long]): Iterator[Long] = {
var res = 0L
while (iter.hasNext) {
val cur = iter.next
res = res + cur
}
List[Long](res).iterator
}
val pathin_plain = <some file>
val rdd0 = sc.sequenceFile[Int, Long](pathin_plain)
val plain_table = rdd0.map(x => x._2).cache
plain_table.count
0 to 200 foreach { i =>
println("Plain - 10M rows - Run "+i+":")
plain_table.mapPartitions(aggrMapPlain).reduce((x,y)=>x+y)
}
On executing this the Spark UI first spikes to about 40 cores, and then settles at 26 cores.
On recommendation of this I changed the following in my yarn-site.xml
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>101</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>101</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>102400</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>102400</value>
</property>
But I still cannot force spark to use 100 cores, which I need as I am doing benchmarking against earlier tests.
I am using Apache Spark 1.6.1. Each node on the cluster including the driver has 16 cores and 112GB of memory. They are on Azure (hdinsight cluster). 2 driver nodes + 7 worker nodes.