4

Some operations like median, and mean, are non-commutative. It seems that there can only be one reducer in this case, as the reducer needs to have a global view. Is there any non-commutative reducer in map-reduce that can be executed in parallel? When encounter non-commutative operations, do people really use map-reduce? Or just run it on some very powerful machine? Is there common ways to break non-commutative operations into commutative operations?

Thanks

Shen Li
  • 411
  • 3
  • 14
  • There is some discussion of this here http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html – Mark Butler Feb 07 '14 at 14:15
  • Related question here http://stackoverflow.com/questions/11731770/combiner-and-reducer-can-be-different – Mark Butler Feb 07 '14 at 14:19

1 Answers1

1

I don't know if "commutative" is the right word to use here, but I understand what you are saying.

In hadoop, the post-mapping phase is actually divided into two steps: a Combiner and a Reducer, with the same signature. The Combiner runs on mappers to reduce the size of the output before it gets key-sorted and sent to reducers. If you just specify a Reducer, then it will be used for both; but you can separate them to do surprisingly more than you think.

The simple case of doing a counting operation uses a counting reducer, which can be used for both the combine step and the reduce step. This reduces the need to have the same key sent over the wire multiple times.

You can achieve similar efficiency for computing the mean by defining different combiners and reducers. For example, mappers output a value (number, 1) corresponding to a numerical value and a count of 1. The combiner can map a collection of values to either a (sum, count) tuple or a (mean, count) tuple, and the reducer can aggregate these using the counted weights to produce an average. (As an aside: you can greatly reduce error in adding a lot of numbers using Kahan summation). This allows mappers to do some of the combining just as in a simple counting example.

You can do a lot of clever things in a single map-reduce step. However, I don't think this is possible for the median; in that case you will actually have to send all the numbers through the state of one machine.

Andrew Mao
  • 35,740
  • 23
  • 143
  • 224
  • Hi @Andrew_Mao, thanks for your replay. So, I guess you mean we can assume the reducers are "commutative" (please let me know if there is a better word in this scenario)? Otherwise (say median), we can launch a map-only job and execute the reducing part on one single machine. – Shen Li Feb 14 '13 at 06:55
  • Hi, I don't understand your question "we can assume the reducers are commutative". A commutative operator means it is insensitive to the order of its operands, while here we are talking about the difference of being able to use the same reducer for both combining and reducing, or needing different ones to get a result. By the way, others have discussed approximations to the median: http://stackoverflow.com/questions/6968215/find-median-of-a-large-integer-set-using-mapreduce, http://stackoverflow.com/questions/10109514/computing-median-in-map-reduce – Andrew Mao Feb 14 '13 at 07:37