1

How can you pass a small number of metadata collected in the Mapper to the Reducer? In my specific problem, I only want to pass two long values, so I wouldn't use MultipleOutputFormat or MultipleOutputs for these.

Some variants I have tried:

(1)

Mapper

    context.getCounter("Countergroup", "Counter").increment(1);

Reducer

    counter = context.getCounter("Countergroup", "Counter").getValue(); 

Counters are not updated regularly, so the function call in the Reducer results in a 0 value.



(2)

Mapper

    context.getConfiguration().setInt("Counter", countTotal);

Reducer

    counter = context.getConfiguration().getInt("Counter", 0);          

Certainly Configurations can not be changed during a running job (was worth trying).

There have already been questions about this problem, but I could not find a working answer. Also, the API has changed. I am using Hadoop 0.20.2 .



Similar questions:

Passing values from Mapper to Reducer

Accessing a mapper's counter from a reducer (this looks promising, but it seems as if it does not work with the 0.20.2 API)

Community
  • 1
  • 1
kapibarasama
  • 211
  • 2
  • 9

1 Answers1

1

If you cannot find the solution to your problem (passing two long values from mapper to reducer in your specific case) using counters, another approach can be taking advantage of the order inversion pattern.

In this pattern, what you do is emit an extra key-value pair from map, with the key being something, which becomes the first key reducer receives (taking advantage of the fact that reducer receives keys in sorted order). For example, if the keys you are emitting are numeric values from 1 to 1000. Your dummy key could be "0". Since reducer receives the keys in sorted order, it is guarenteed to process the dummy key before any other key.

You additionaly have SetUp() and CloseUp() methods in the new API (there are similar methods in the old API too but I don't remember the name) to take advantage of the fact that they only execute exactly once on each node, before/after all the map/reduce tasks on that node start/finish.

Nishant Nagwani
  • 1,160
  • 3
  • 13
  • 26
  • That only works if you only have a single reducer. What I understand from the OP's question is that this metadata needs to be available to all reducers, not just the one that happens to luck out and get the special key. If you can spare the data bloat, you can multiplex the metadata to all keys thus guaranteeing that it is seen by every call to reduce(), and you can do some additional secondary sort trickery to insure that the metadata value is seen first when iterating the group. – Judge Mental May 25 '12 at 19:27
  • Yes, I agree that it works fine only with 1 reducer, unless the data is huge enough for the program to get too slow with 1 reducer. Otherwise, you could emit multiple keys and write a custom partioner. I agree that emitting multiple keys might not be very clean, but its a trade-off with emitting the metadata with each key-value pair and doing secondary sort, since the latter approach makes you use a lot of unnecessary memory. – Nishant Nagwani May 25 '12 at 19:46
  • Now *that* I like (custom partitioner). Emit as many keys as there are reducers and insure via the partitioner that each reducer gets one copy, and insure via a custom comparator that the metadata key comes before all real keys. – Judge Mental May 25 '12 at 22:59
  • Thank you very much; this little trick works fine in my current solution. For the additional first records, I needed to find values that surely never occur in the data set (which can be solved in code, at least here), and would then occur first in my sorting structures. – kapibarasama May 29 '12 at 18:59
  • The additional first elements can more easily be implemented by adding a boolean in the key (e.g. isMetadata) and an appropriate compareTo() function. – kapibarasama Jul 11 '12 at 23:48