I am using Hadoop 1.0.3 for a 10 Desktop cluster system each having Ubuntu 12.04LTS 32 bit OS. The JDK is 7 u 75. Each machine has 2 GB RAM with core 2-duo processor.
For a research project, I need to run a hadoop job similar to "Word Count". And I need to run this operation for a big amount of dataset for example at least 1 GB in size.
I am trying hadoop's example jar hadoop-examples-1.0.3.jar to use for counting words of a input dataset. Unfortunately, I cannot run any experiment which has more than 5-6 MB input data.
For input I am using plain text formant story books from https://www.gutenberg.org. Also I used some rfcs from https://www.ietf.org. All the inputs are .txt format English writing.
My system can give proper output for a single .txt document. However, when it has more that 1 .txt files it starts to continuously giving the error:
INFO mapred.JobClient: Task Id : attempt_XXXX, Status : FAILED
Too many fetch-failures
The dataset also working fine when I use a single node cluster. I have got some solutions from previous stackoverflow posts for example this one and this one and also some more. But none of these worked for my case. According to their suggestion I have updated my /usr/local/hadoop/conf/mapred-site.xml file as follows:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.task.timeout</name>
<value>1800000</value>
</property>
<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>0.9</value>
</property>
<property>
<name>tasktracker.http.threads</name>
<value>90</value>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>10</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>100</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>7</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/user/localdir</value>
</property>
</configuration>
In this file I have collected the value for property: “mapred.local.dir“, “mapred.map.tasks“, “mapred.reduce.tasks“ from michael-noll's blog. Also I have set,
export HADOOP_HEAPSIZE=4000
From conf/hadoop-env.sh file.
As I have set the environment of all the 10 machines with hadoop-1.0.3 it will be more helpful for me if someone can give me solution without changing the hadoop version.
Also I want to mention that I am a newbie in hadoop. I have found many articles about hadoop but I could fix any article as a standard for this topic. If anybody know any informative and authentic article regarding hadoop feel free to share with me.
Thanks everyone in advance.