1

I'm using the mongo-hadoop adapter to run map/reduce jobs. everything is fine except the launch time and the time taken by the job. Even when the dataset is very small, the map time is 13 seconds and reduce time is 12 seconds. In fact I have changed settings in mapred-site.xml and core-site.xml. but the time taken for map/reduce seems to be constant. is there any way i can reduce it. I also explored the optimized hadoop distribution from hanborq. they use a worker pool for faster job launch/setup. is there any equivalent available elsewhere as the hanborq distribution is not very active. it was updated 4 months ago and is built on an older version of hadoop.

some of my settings are as follows: mapred-site.xml:

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xms1g</value>
</property>
<property>
    <name>mapred.sort.avoidance</name>
    <value>true</value>
</property>
 <property>
      <name>mapred.job.reuse.jvm.num.tasks</name>
          <value>-1</value>
 </property>
<property>
     <name>mapreduce.tasktracker.outofband.heartbeat</name>
     <value>true</value>
</property>
   <property>
       <name>mapred.compress.map.output</name>
       <value>false</value>
   </property>

core-site.xml:

<property>
          <name>io.sort.mb</name>
          <value>300</value>
    </property>
<property>
    <name>io.sort.factor</name>
    <value>100</value>
</property>

Any help would be greatly appreciated. thanks in advance.

Community
  • 1
  • 1
  • Why are you not using mongodbs internal mapreduce? And Hadoop is really not usable for such realtime stuff. – Thomas Jungblut Jun 02 '12 at 15:39
  • I also think that there is no way to reduce hadoop jobs latency – David Gruzman Jun 02 '12 at 18:32
  • i read a lot about the inefficiencies of mongodbs internal m/r, for eg: "Mongo M/R only helps if you need simple group by and filtering, not heavy shuffling between map and reduce. Hadoop's M/R is capable of utilizing all cores, while MongoDB is single threaded" etc. Also my code needs to work with very large dataset which will mostly be processed offline with hadoops M/R. However at runtime whenever a new user logs in, I need to match his data with millions of other users data within seconds. (use case is similar to that of a dating site). Any thoughts on the solution would be most welcome. – Faiza Atheeq Jun 02 '12 at 20:46
  • If you need something within seconds no map/reduce framework is a good fit, I would think. Have you considered denormalizing your schema in mongo itself to allow specific query to be run interactively (i.e. via find, not map/reduce)? Obviously that's a separate question from this one but if you haven't considered it, I think you should. – Asya Kamsky Jun 03 '12 at 21:22
  • thanks Asya. I'm trying out your suggestion. However as i'm new to MongoDB, I just realised that group by's or aggregate functions cannot be used for datasets of above 10K in size. this is such a disappointment. To what extent will sharding help? – Faiza Atheeq Jun 05 '12 at 07:11

1 Answers1

1

Since the heartbeat cause part of the latency. The task trackers heartbeat in to the job tracker to let it know they're alive, but as part of that heartbeat, they also announce how many open map and reduce slots they have. In response, the JT assigns work for the TT to perform. This means that when you submit a job TTs only get tasks as fast as they heartbeat (every 2 - 4 seconds, give or take). Additionally, the JT (by default) only assigns a single task during each heartbeat. This means that if you only have a single TT you can only assign 1 task every 2 - 4 seconds even if the TT has additional capacity.

So, you can:

  1. shorten the duration between two heartbeat.

  2. change how the task scheduler works for each heartbeat from TaskTracker. mapred.fairscheduler.assignmultiple

Kun Ling
  • 2,211
  • 14
  • 22