I have a 8 node Hadoop cluster, where each node has 24 physical cores with Hyper-threading, thus, 48 vCPUs and 256GB memory.
I am trying to run a 6TB Terasort job.
Problem: Terasort runs with no errors when I use yarn.nodemanager.resource.cpu-vcores=44 (48 minus 4 for OS, DN, RM, etc.). However, when I try to over-subscribe the CPUs with yarn.nodemanager.resource.cpu-vcores=88, I get several map and reduce errors.
All map failures are like "Too many fetch failures....". All reduce erros are like "....#Block does not have enough number of replicas....".
I have seen THIS and THIS links. I have checked my /etc/hosts files and also bumped my net.core.somaxconn kernel parameter.
I don't understand why do I get map and reduce failures with over-subscribed CPUs.
Any hints or recommendations would be helpful, and thanks in advance.