0

I want to run spark-shell in yarn mode with a certain number of cores.

the command I use is as follows

spark-shell --num-executors 25 --executor-cores 4 --executor-memory 1G \
--driver-memory 1G --conf spark.yarn.executor.memoryOverhead=2048 --master yarn \
--conf spark.driver.maxResultSize=10G  \
--conf spark.serializer=org.apache.spark.serializer.KyroSerializer \
-i input.scala

input.scala looks something like this

import java.io.ByteArrayInputStream

// Plaintext sum on 10M rows
def aggrMapPlain(iter: Iterator[Long]): Iterator[Long] = {
   var res = 0L
while (iter.hasNext) {
    val cur = iter.next
    res = res + cur
}
List[Long](res).iterator
 }

val pathin_plain = <some file>

val rdd0 = sc.sequenceFile[Int, Long](pathin_plain)
val plain_table = rdd0.map(x => x._2).cache
plain_table.count

0 to 200 foreach { i =>
    println("Plain - 10M rows - Run "+i+":")
    plain_table.mapPartitions(aggrMapPlain).reduce((x,y)=>x+y)
   }

On executing this the Spark UI first spikes to about 40 cores, and then settles at 26 cores.

On recommendation of this I changed the following in my yarn-site.xml

<property>                                                                                                          
       <name>yarn.nodemanager.resource.cpu-vcores</name>                                                                 
        <value>101</value>
</property>

<property>                                                                                                          
    <name>yarn.scheduler.maximum-allocation-vcores</name>                                                             
   <value>101</value>                                                                                                
 </property>

<property>                                                                                                          
      <name>yarn.scheduler.maximum-allocation-mb</name>                                                                 
       <value>102400</value>                                                                                             
</property>

<property>                                                                                                          
    <name>yarn.nodemanager.resource.memory-mb</name>                                                                  
    <value>102400</value>                                                                                             
</property>  

But I still cannot force spark to use 100 cores, which I need as I am doing benchmarking against earlier tests.

I am using Apache Spark 1.6.1. Each node on the cluster including the driver has 16 cores and 112GB of memory. They are on Azure (hdinsight cluster). 2 driver nodes + 7 worker nodes.

Community
  • 1
  • 1
Sood
  • 149
  • 1
  • 1
  • 11

1 Answers1

0

I'm unfamiliar with Azure, but I guess YARN is YARN, so you should make sure that you have

yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

in capacity-scheduler.xml.

(See this similar question and answer)

Community
  • 1
  • 1
Glennie Helles Sindholt
  • 12,816
  • 5
  • 44
  • 50
  • Thank you, I have made that edit to my `capacity-scheduler.xml` but still no change. Currently, to get around this I am setting `--num-executors` to 100 which gives me 101 cores in the Spark UI. – Sood Sep 30 '16 at 06:11
  • After making some edits here and there and rewriting some of my code, I have managed to solve the issue. I changed so many things, I'm not sure which one worked but this definitely helped. – Sood Oct 03 '16 at 07:08
  • @Sood I realize this is a very old thread, but would you mind sharing your final configs? – maverik Apr 20 '18 at 21:18
  • @maverik sorry, but I no longer have access to the codebase. – Sood Apr 22 '18 at 05:07