1

I was wondering if it is possible to define a Hierarchical MapReduce job?. In other words I would like to have a map-reduce job, that in the mapper phase will call a different MapReduce job. Is it possible? Do you have any recommendations how to do it?

I want to do it in order to have additional level of parallelism/distribution in my program. Thanks, Arik.

Roman Nikitchenko
  • 12,800
  • 7
  • 74
  • 110
Arik B
  • 21
  • 2

3 Answers3

2

Hadoop definitive guide book contains lot of recipes related to MapReduce job chaining including sample code and detailed explanation. Especially chapter called like 'advanced API usage' or something near it.

I personally succeeded with replacement of complex map-reduce job with several HBase tables used as sources with handmade TableInputFormat extension. The result was input format which combines source data with minimal reduction so job was transformed to single mapper step. So I recommend you to look in this direction too.

Brad Larson
  • 170,088
  • 45
  • 397
  • 571
Roman Nikitchenko
  • 12,800
  • 7
  • 74
  • 110
  • Roman, the reason I removed that link is because it is to a pirated version of the ebook. We do not allow links like that here. If you want to link to it, do so to O'Reilly's legitimate site. – Brad Larson Jul 13 '13 at 22:26
  • Their "sample" was 647 pages long and contained the original O'Reilly formatting, as well as every page from every chapter I checked. That site is a known host of bootleg material like this, so we've scrubbed all links to it here. – Brad Larson Jul 15 '13 at 14:30
1

You should try Cascading. It allows you to define pretty complex jobs with multiple steps.

kichik
  • 33,220
  • 7
  • 94
  • 114
  • Thanks, I'm looking for a "native" hadoop solution for this task/ Any suggestions? – Arik B Jun 10 '13 at 07:26
  • Native meaning no external libraries? Just wait until the job is done and submit a new one. – kichik Jun 10 '13 at 08:12
  • Or this http://stackoverflow.com/questions/2499585/chaining-multiple-mapreduce-jobs-in-hadoop – kichik Jun 10 '13 at 08:13
  • Hi, Regarding "native" this is what I meant. Regarding the other part this is not what I meant. I want to run the second mapreduce job for every record I read in the mapper of the first mapreduce job – Arik B Jun 10 '13 at 09:05
0

I guess you need oozie tool. Oozie helps in defining workflows using an xml file.