The reduce task is stopped by Too Many Fetch Failure message in Hadoop multi node (10x) cluster

Question

I am using Hadoop 1.0.3 for a 10 Desktop cluster system each having Ubuntu 12.04LTS 32 bit OS. The JDK is 7 u 75. Each machine has 2 GB RAM with core 2-duo processor.

For a research project, I need to run a hadoop job similar to "Word Count". And I need to run this operation for a big amount of dataset for example at least 1 GB in size.

I am trying hadoop's example jar hadoop-examples-1.0.3.jar to use for counting words of a input dataset. Unfortunately, I cannot run any experiment which has more than 5-6 MB input data.

For input I am using plain text formant story books from https://www.gutenberg.org. Also I used some rfcs from https://www.ietf.org. All the inputs are .txt format English writing.

My system can give proper output for a single .txt document. However, when it has more that 1 .txt files it starts to continuously giving the error:

INFO mapred.JobClient: Task Id :      attempt_XXXX, Status : FAILED
Too many fetch-failures

The dataset also working fine when I use a single node cluster. I have got some solutions from previous stackoverflow posts for example this one and this one and also some more. But none of these worked for my case. According to their suggestion I have updated my /usr/local/hadoop/conf/mapred-site.xml file as follows:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
<property>
  <name>mapred.task.timeout</name>
  <value>1800000</value> 
</property>
<property>
  <name>mapred.reduce.slowstart.completed.maps</name>
  <value>0.9</value> 
</property>
<property>
  <name>tasktracker.http.threads</name>
  <value>90</value> 
</property>
<property>
  <name>mapred.reduce.parallel.copies</name>
  <value>10</value> 
</property>
<property>
  <name>mapred.map.tasks</name>
  <value>100</value> 
</property>
<property>
  <name>mapred.reduce.tasks</name>
  <value>7</value> 
</property>
<property>
  <name>mapred.local.dir</name>
  <value>/home/user/localdir</value> 
</property>

</configuration>

In this file I have collected the value for property: “mapred.local.dir“, “mapred.map.tasks“, “mapred.reduce.tasks“ from michael-noll's blog. Also I have set,

export HADOOP_HEAPSIZE=4000

From conf/hadoop-env.sh file.

As I have set the environment of all the 10 machines with hadoop-1.0.3 it will be more helpful for me if someone can give me solution without changing the hadoop version.

Also I want to mention that I am a newbie in hadoop. I have found many articles about hadoop but I could fix any article as a standard for this topic. If anybody know any informative and authentic article regarding hadoop feel free to share with me.

Thanks everyone in advance.

score 0 · Accepted Answer · answered Jun 08 '15 at 03:16

My problem is now solved. Actually the problem was in my network settings. Unfortunately, the Hadoop system could not locate the right machine at the time of reduce due to my faulty network settings.

The correct network settings should be:

At /etc/hosts file the following info should contain:

localhost 127.0.0.1

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

master 192.168.x.x
slave1 192.168.x.y
....

And in the file /etc/hostname

We should just mention the hostname that is written in the hosts file. For example, in master machine we should write just one word in hostname file. It is:

master

For the machine slave1 the file should contain:

slave1

The reduce task is stopped by Too Many Fetch Failure message in Hadoop multi node (10x) cluster

1 Answers1