Too many fetch failures when running Terasort

Question

I have a 8 node Hadoop cluster, where each node has 24 physical cores with Hyper-threading, thus, 48 vCPUs and 256GB memory.

I am trying to run a 6TB Terasort job.

Problem: Terasort runs with no errors when I use yarn.nodemanager.resource.cpu-vcores=44 (48 minus 4 for OS, DN, RM, etc.). However, when I try to over-subscribe the CPUs with yarn.nodemanager.resource.cpu-vcores=88, I get several map and reduce errors.

All map failures are like "Too many fetch failures....". All reduce erros are like "....#Block does not have enough number of replicas....".

I have seen THIS and THIS links. I have checked my /etc/hosts files and also bumped my net.core.somaxconn kernel parameter.

I don't understand why do I get map and reduce failures with over-subscribed CPUs.

Any hints or recommendations would be helpful, and thanks in advance.

score 0 · Answer 1 · answered May 18 '18 at 14:21

I got to the bottom of the “Too many fetch…” error. What was happening was that because the servers were heavily loaded when running my 7TB job (remember that 1TB jobs always ran successfully), there were not enough connections happening between the master and the slaves. I needed to increase the listen queue between the master and the slave, which can be done by modifying a kernel parameter called “somaxconn”.

By default, “somaxconn” is set to to 128 in the rhel OS. By bumping it to 1024, the 7TB terasort job ran successfully with no failures.

Hope this helps someone.

Too many fetch failures when running Terasort

1 Answers1