2

There is a little amount of meta-data that I get by looking up the current file the mapper is working on (and a few other things). I need to send over this meta-data to the reducer. Sure, I can have the mapper emit this in the < Key, Value> pair it generates as < Key, Value + Meta-Data>, but I want to avoid it.

Also, constraining myself a little bit more, I do not want to use DistributedCahce. So, do I still have some options left? More precisely, my question is twofold

(1) I tried setting up some parameters by doing a job.set(Prop, Value) in my mapper's configure(JobConf) and doing a job.get() in my reducer's configure(JobConf). Sadly, I found it does not work. As one aside, I am interested in knowing why this behavior. My main question is

(2) How can I send the value from the mapper to the reducer in a "clean way" (if possible, within the constraints I want).

EDIT (In view of response by Praveen Sripati)

To make it more concrete, here is what I want. Based on the type of data emitted we want it stored under different files (say data d1 ends up in D1 and data d2 ends up in D2).

The values D1 and D2 can be read in config file and figuring out what goes where depends on the value of map.input.file. That is, the pair < k1, d1> after some processing should go to D1 and < k2,d2> should go to D2. I do not want to emit things like < k1, d1+D1>. Can, I somehow obtain figure out the association without emitting D1 or D2, maybe by cleverly using the config file? The input source (i.e., input directory) for k1,d1 and k2,d2 is the same which again can be seen only through map.input.file

Please let me know when you get time.

Usama Abdulrehman
  • 1,041
  • 3
  • 11
  • 21
Akash Kumar
  • 121
  • 6
  • Why is it a problem for you to emit D1/D2, there? – James Jan 21 '12 at 05:30
  • @James, it is not a problem; I can do that. But there is very few of these output directories and I felt that passing it involves too many extra string manipulations which I will like to avoid if possible. If I have no other option, probably go with this option [:(]. But I would still appreciate it if there is some other option – Akash Kumar Jan 21 '12 at 06:00

1 Answers1

0

Based on the type of data emitted we want it stored under different directories (say data d1 ends up in D1 and data d2 ends up in D2).

Usually the o/p of the MR job will go to a single output folder. Each mapper/reducer will write to a separate file. I am not sure how to write an MR job o/p output to different directories without any changes to the Hadoop framework.

But, based on the output key/value types from the mapper/reducer the output file can be choosen. Use the subclasses of the MultipleOutputFormat. The MultipleOutputFormat#generateFileNameForKeyValue method has to be implemented, return a string based on the input key.

See how PartitionByStationUsingMultipleOutputFormat is implemented in the code of the Hadoop - The Definitive Guide book.

Once the job has been completed, the o/p can be moved easily using hadoop commands to a different directory.

Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
  • thanks for your reply. Just wanted to add a little note D1 and D2 need not be directories. Say they are different files. I know about MultiFileOutputFormat. BTW, I am sorry for wrongly stating what I wanted - I will make the edit to correctly reflect what I want. I appreciate your response - however I would still like to know if I can get my hands on where to put d1 and where to put d2 without having the mapper emit D1 and D2 in the value string. Thanks again – Akash Kumar Jan 21 '12 at 08:35
  • Your query is a bit (actually a lot) confusing :) You should make it simple and precise to get a proper response. So do you want to put the o/p of reducer to a file based on key or value type? either way you can do the mapping (K or V to file name) in a file and put it in HDFS. Read this file at the start up of the reducer code and put it into a static variable and use that variable in the generateFileNameForKeyValue method to return the appropriate o/p file name. I leave it here for you to figure it out. – Praveen Sripati Jan 21 '12 at 11:49
  • 1
    thanks again @praveen for your response. I guess I might as well go with < K, V + D > solution. Just wanted to ask if you had any ideas about (1) though. Why can't I do a job.set in mapper's configure and access it in reducer? – Akash Kumar Jan 21 '12 at 17:09