Too many fetch faliuers

Question

I have a setup, 2 node hadoop cluster on Ubuntu 12.04 and Hadoop 1.2.1. While I am trying to run hadoop word count example I am gettig "Too many fetch faliure error". I have referred many articles but I am unable to figure out what should be the entries in Masters,Slaves and /etc/hosts file. My nodes names are "master" with ip 10.0.0.1 and "slaveone" with ip 10.0.0.2.

I need assistance in what should be the entries in masters,slaves and /etc/hosts file in both master and slave node?

Any reason why you're running 1.2.1? I believe it's deprecated. You should strongly consider being on the 2.x stack (probably 2.4+) — Pradeep Gollakota, Jan 16 '15 at 08:23
It may be a stable version but it's really really old and is not recommended for use. Try upgrading to a later stable version such as 2.4, 2.5 or 2.6 — Pradeep Gollakota, Jan 16 '15 at 21:40

score 3 · Accepted Answer · edited May 23 '17 at 11:59

If you're unable to upgrade the cluster for whatever reason, you can try the following:

Ensure that your hostname is bound to the network IP and NOT 127.0.0.1 in /etc/hosts
Ensure that you're using only hostnames and not IPs to reference services.
If the above are correct, try the following settings:

set mapred.reduce.slowstart.completed.maps=0.80
set tasktracker.http.threads=80
set mapred.reduce.parallel.copies=(>= 10)(10 should probably be sufficient)

Also checkout this SO post: Why I am getting "Too many fetch-failures" every other day

And this one: Too many fetch failures: Hadoop on cluster (x2)

And also this if the above don't help: http://grokbase.com/t/hadoop/common-user/098k7y5t4n/how-to-deal-with-too-many-fetch-failures For brevity and in interest of time, I'm putting what I found to be the most pertinent here.

The number 1 cause of this is something that causes a connection to get a map output to fail. I have seen: 1) firewall 2) misconfigured ip addresses (ie: the task tracker attempting the fetch received an incorrect ip address when it looked up the name of the tasktracker with the map segment) 3) rare, the http server on the serving tasktracker is overloaded due to insufficient threads or listen backlog, this can happen if the number of fetches per reduce is large and the number of reduces or the number of maps is very large.

There are probably other cases, this recently happened to me when I had 6000 maps and 20 reducers on a 10 node cluster, which I believe was case 3 above. Since I didn't actually need to reduce ( I got my summary data via counters in the map phase) I never re-tuned the cluster.

EDIT: Original answer said "Ensure that your hostname is bound to the network IP and 127.0.0.1 in /etc/hosts"

yes, i have tried some permutation and combination to add ip address in /etc/hosts file...but with my configuration ie nodes names are "master" with ip 10.0.0.1 and "slaveone" with ip 10.0.0.2, what should be my entries in MASTERS, SLAVES and /etc/hosts file please? — akash sabarad, Jan 16 '15 at 11:21

Too many fetch faliuers

1 Answers1

Linked