1

I am testing MapReduce wordcount example on Amazon EC2 m1.small instance I have followed Amazon command line getting started guide.

bin/ec2hadoop launch-cluster test 2

Using this command I have 2 slave nodes. (in total 3 instances running) Then I can login to master node to run the hadoop program (which is a bundled into jar file) It took 35 minutes.

For scalability experiment, then I ran the same program using 4 instances

bin/ec2hadoop launch-cluster test 4

To my surprise, I did not see any gain in performance. The MapReduce application took almost same amount of time (33 minutes)

Where can the problem lie ? Is it acceptable behaviour ?

In mapred-site.xml
mapred.tasktracker.map.tasks.maximum is set to 1
mapred.tasktracker.reduce.tasks.maximum is set to 1

Any suggestions are welcome.

Trojosh
  • 565
  • 1
  • 5
  • 15
  • How many input files and of what size are there? Have you examined the distribution of tasks to the machines in the tasktracker web ui? Did map tasks run on every node or just at a single one? – harpun Mar 02 '13 at 13:16
  • input files 10 from s3n textData files. i have not checked distribution of tasks to machines.. I don't know how to check that in tasktracker web ui.. I will search how to do that ? Also I assume that (which may be wrong) Hadoop framework uses all slaves for *map* tasks but am not sure whether map tasks are running on every node or single node – Trojosh Mar 02 '13 at 14:12
  • sorry, I meant the Jobtracker Web UI. Usually `http://:50030/` From the UI you can check various details about your completed job. See [Monitoring Hadoop](http://docs.hortonworks.com/CURRENT/index.htm#Monitoring_HDP/Understanding_Monitoring_For_Hadoop/Key_Information_Resources_For_Monitoring_Hadoop.htm) for some pointers. – harpun Mar 02 '13 at 15:41
  • Yes. I saw it just now. Thing is that map task is done by all slave instances. but reduce is done by only one of the slave nodes. and it seems that reduce time is dominating the overall performance. (with 4 slaves, I can see 4 map tasks but only 1 reduce task that is done by one slave while all other slaves are idle) – Trojosh Mar 02 '13 at 15:52

2 Answers2

0

Based on your configuration you have a maximum of 1 map and 1 reduce tasks per node respectively. Depending on the type of job you're running, it might be useful to set these parameters to the number of cores on the node. Especially when the map/reduce tasks are computationally expensive.

In case, as your said in your comment, the reduce phase is dominating the overall performance of the job, you should focus on this part. In Hadoop the number of reduce tasks for a job can be specified in the job's configuration, because it directly affects the number of produced output files.

Having a single reduce task would give you a single output file. Having N reduce tasks would result in N output files. Each of this files contains data sorted by key. Additionally reduce task is ensured to get all data for a given key from the map tasks.

In short words: increasing the number of reduce tasks would improve the overall performance of the job, but produce multiple output files. Depending on your needs, these files would have to be merged and sorted by key either in a second map reduce job (or outside Hadoop) in order to ensure, that you get the same single output file, just like with a single reduce task.

harpun
  • 4,022
  • 1
  • 36
  • 40
  • am using -D mapred.reduce.tasks=4 option now to run the job. But still the same result. Do I need to set it using jobConf.setNumReduceTasks() as said here : http://stackoverflow.com/questions/6885441/setting-the-number-of-map-tasks-and-reduce-tasks – Trojosh Mar 02 '13 at 16:26
  • @Trojosh: use `-Dmapred.reduce.tasks=4` (without spaces). Additionally it depends on the kind of job you're running as it may override or ignore this parameter. If you run your own source code of wordcount the best way would be to set jobConf.setNumReduceTasks() just like you said. Furthermore, if you want to be able to pass the command line parameter to your own application, see http://stackoverflow.com/questions/11722424/hadoop-reduce-task-running-even-after-telling-on-command-line-as-d-mapred-reduc – harpun Mar 02 '13 at 16:43
  • am getting OutofMemoryError when I give reduce.tasks=4. may be because I have dfs.replication=2 (does it mean data nodes = 2 ?) and i have max 1 map and 1 reduce tasks per node. So this value should be in between (0.95 * 1 * 2) and (1.75 * 1 * 2) and 4 is more than both – Trojosh Mar 02 '13 at 16:57
  • It is not related to DFS replication. You probabbly set number of reducers per task tracker to 4, and it is something you can not do on small instance – David Gruzman Mar 02 '13 at 18:18
0

First of all, if properly configured and with growing number of reducers as cluster grows Hadoop should show linear scalability
I think that the root cause of results you get is single reducer. When results of all mappers are passed to the single reducers it limit any performance gains from the cluster size. If you will set number of reducers to 4 (by number of nodes in cluster) you should see the gain.
In addition - I have some doubt about hadoop efficient operation on Small instances. The memory is near limit and swapping can start and kill any performance. In addition - such instance will get very little fraction of the 1GB ethernet and it also can be limiting factor.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
  • Where should I set number of reducers ? (1) in mapred-sites.xml in mapred.tasktracker.reduce.tasks.maximum (2) command line option -Dmapred.reduce.tasks=2 or (3) from program jobConf.setNumReduceTasks(n) I have tried (3), which gives me Java Heap space error. (2) does not change number of reducers. Can you please explain a bit about what exactly I have to configure ? – Trojosh Mar 02 '13 at 17:59
  • (2),(3) are right, but (2) require implementing special tools interface (http://stackoverflow.com/questions/2115292/run-hadoop-job-without-using-jobconf). jobConf.setNumReduceTasks(n) should work for sure. Out of memory - should not discourge you. Set num of reducers to 4 (in option(3)) and to 1 in (1) – David Gruzman Mar 02 '13 at 18:15
  • Out of memory can be solved by fine tuning for small instance, but I would suggest to use bigger instance. – David Gruzman Mar 02 '13 at 18:16
  • I am not getting what do you mean by *fine tuning for small instance*. – Trojosh Mar 02 '13 at 18:40
  • You can control what part of memory is devoted per reducer, how it is divided between shuffling and reducer itself – David Gruzman Mar 02 '13 at 18:48
  • sorry to report that am getting same *Out of memory* error even for the m1.medium and m1.large instances. I also tried to increase child.java.opt value to 512; but no luck. Thus, I just cannot have more than one reducer running. When I setNumReduceTasks() to 4 from program, how come my Map tasks are failing due to out of memory error ? – Trojosh Mar 02 '13 at 19:06
  • It is strange. Can you post stack trace of the exception? – David Gruzman Mar 02 '13 at 19:26
  • Task Id : attempt_201303021844_0001_m_000003_0, Status : FAILED Error: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$InMemValBytes.reset(MapTask.java:1566) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.getVBytesForOffset(MapTask.java:1549) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1416) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344) – Trojosh Mar 02 '13 at 19:35
  • This happens somewhere when it is `INFO mapred.JobClient: map 46% reduce 0%` – Trojosh Mar 02 '13 at 19:36
  • One of two: or your cluster configuration is seriously wrong, or your map function accumulate data. – David Gruzman Mar 02 '13 at 19:38
  • This happens only when I set number of reduce tasks to 2 or 4. If I don't set this through program, everything runs just fine.. but what I see is my reduce phase runs very slow like `map 100% reduce 16% map 100% reduce 33% map 100% reduce 66% map 100% reduce 67% map 100% reduce 68% ... by one percent till 100%` That is why I wanted to try increasing number of reducers – Trojosh Mar 02 '13 at 19:49
  • what exactly you set? number of reducers per task tracker, or reducers per job? – David Gruzman Mar 02 '13 at 20:40
  • What hadoop configuration you are using? Is it EMR? – David Gruzman Mar 02 '13 at 21:02
  • The problem was indeed my mapper function, which was accumulating lots of data. `conf.setNumReduceTasks(4)` just worked fine then. Thanks a lot! – Trojosh Mar 03 '13 at 01:12