If you're unable to upgrade the cluster for whatever reason, you can try the following:
- Ensure that your hostname is bound to the network IP and NOT 127.0.0.1 in
/etc/hosts
- Ensure that you're using only hostnames and not IPs to reference services.
- If the above are correct, try the following settings:
set mapred.reduce.slowstart.completed.maps=0.80
set tasktracker.http.threads=80
set mapred.reduce.parallel.copies=(>= 10)(10 should probably be sufficient)
Also checkout this SO post: Why I am getting "Too many fetch-failures" every other day
And this one: Too many fetch failures: Hadoop on cluster (x2)
And also this if the above don't help: http://grokbase.com/t/hadoop/common-user/098k7y5t4n/how-to-deal-with-too-many-fetch-failures
For brevity and in interest of time, I'm putting what I found to be the most pertinent here.
The number 1 cause of this is something that causes a connection to get a
map output to fail. I have seen:
1) firewall
2) misconfigured ip addresses (ie: the task tracker attempting the fetch
received an incorrect ip address when it looked up the name of the
tasktracker with the map segment)
3) rare, the http server on the serving tasktracker is overloaded due to
insufficient threads or listen backlog, this can happen if the number of
fetches per reduce is large and the number of reduces or the number of maps
is very large.
There are probably other cases, this recently happened to me when I had 6000
maps and 20 reducers on a 10 node cluster, which I believe was case 3 above.
Since I didn't actually need to reduce ( I got my summary data via counters
in the map phase) I never re-tuned the cluster.
EDIT: Original answer said "Ensure that your hostname is bound to the network IP and 127.0.0.1 in /etc/hosts
"