5

I have a 9 node Hadoop cluster, running MapR with the following Yarn roles:

  • 1 node is a Yarn ResourceManager
  • 8 of the nodes are Yarn NodeManagers (two of which are also backup ResourceManagers)

Whenever I submit a Spark job via pyspark I see something like the following:

yarn node -list
17/04/26 23:40:26 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to hadoop-n1/XXX.XXX.XX.XX:8032
Total Nodes:8
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
 hadoop-d8:52118                RUNNING    hadoop-d8:8042                                  4
 hadoop-d5:39108                RUNNING    hadoop-d5:8042                                  0
 hadoop-d2:42471                RUNNING    hadoop-d2:8042                                  0
 hadoop-d4:36191                RUNNING    hadoop-d4:8042                                  0
 hadoop-d3:53476                RUNNING    hadoop-d3:8042                                  0
 hadoop-d6:52497                RUNNING    hadoop-d6:8042                                  4
 hadoop-d1:59887                RUNNING    hadoop-d1:8042                                  0
 hadoop-d7:51878                RUNNING    hadoop-d7:8042                                  4

So d6, d7 and d8 are always used (and always with 4 containers) and the other nodes are never used. I've set spark.dynamicAllocation.enabled to True which made no difference. I've tried restarting all Yarn processes, again with no luck. How can I get my jobs to run on all nodes?

I am running Hadoop version 2.7.0, Spark version 2.0.1 and MapR version 5.2 on Ubuntu 14.04. The Yarn resource calculator is set to DiskBasedResourceCalculator which uses uses Memory, CPU and Disk.

EDIT: Based on this answer suggested in the comments I changed to the DominantResourceCalculator and set spark.executor.instances to 0. Still no changes. If I look at one of the unused nodes I see:

% yarn node -status hadoop-d7:35731
17/04/27 16:01:19 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to hadoop-n1/    Node Report : 
    Node-Id : hadoop-d7:35731
    Rack : /default-rack
    Node-State : RUNNING
    Node-Http-Address : hadoop-d7:8042
    Last-Health-Update : Thu 27/Apr/17 03:59:29:475EDT
    Health-Report : 
    Containers : 4
    Memory-Used : 32768MB
    Memory-Capacity : 97988MB
    CPU-Used : 20 vcores
    CPU-Capacity : 20 vcores
    Node-Labels : 

whereas the good nodes show

% yarn node -status hadoop-d2:37314
17/04/27 16:01:05 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to hadoop-n1/
Node Report : 
    Node-Id : hadoop-d2:37314
    Rack : /default-rack
    Node-State : RUNNING
    Node-Http-Address : hadoop-d2:8042
    Last-Health-Update : Thu 27/Apr/17 03:59:32:255EDT
    Health-Report : 
    Containers : 0
    Memory-Used : 0MB
    Memory-Capacity : 70189MB
    CPU-Used : 0 vcores
    CPU-Capacity : 4 vcores
    Node-Labels : 
Sal
  • 1,653
  • 6
  • 23
  • 36
  • check https://stackoverflow.com/questions/33940884/why-does-yarn-on-emr-not-allocate-all-nodes-to-running-spark-jobs?rq=1 – David Apr 27 '17 at 14:58
  • @David Thanks, I updated my question based on the answer in your link. – Sal Apr 27 '17 at 20:05
  • Could you explain a bit about the size and type of data being processed. Especially number of partitions in your RDDs. – code May 02 '17 at 08:42

0 Answers0