4

I've been reading a nice answer to Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)? provided by samthebest and I am not sure if I understand all the details:

  • According to the answer (reduce vs foldLeft):

    A big big difference (...) is that reduce should be given a commutative monoid, (...)

    This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists.

    and

    Reduce is defined formally as part of the MapReduce paradigm,

    I am not sure how this two statements combine. Can anyone put some light on that?

  • I tested different collections and I haven't seen performance difference between reduce and foldLeft. It looks like ParSeq is a special case, is that right?

  • Do we really need order to define fold?

    we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.

    Why it couldn't be generalized to unordered collection?

Community
  • 1
  • 1
user7337271
  • 1,662
  • 1
  • 14
  • 23
  • What is there to understand? For `foldLeft` you cannot assume associativity/commutativity (no opportunity for parallelization), for `reduce` you can (trivial to parallelize). Not sure how it could be made any clearer than that. These are generic concepts, and extend beyond the perf of any collections that happen to be in the Scala standard library at any particular point in time. – Jared Smith Dec 29 '16 at 21:09
  • 1
    @JaredSmith I think that reduce in MapReduce has different meaning than reduce in Spark or Scala collections. Am I wrong? – user7337271 Dec 29 '16 at 21:17
  • Yes, AFAIK currently the only difference in scala is the seed value that may be supplied to `foldLeft`. But the point of the answer that you reference is that there *should* be a difference involving the mathematical properties of the binary operator applied to the type in question. The answers to this question should help you http://stackoverflow.com/questions/17408880/reduce-fold-or-scan-left-right – Jared Smith Dec 29 '16 at 21:34
  • @JaredSmith Thank you but I don't have any doubts about (fold|reduce)(left|right). I just don't understand a) association between reduce in Map__Reduce__ and reduce and reduce in Spark b) Why limit fold to ordered collections when conceptual it is more general – user7337271 Dec 31 '16 at 10:59

1 Answers1

13

As mentioned in the comments, the term reduce means different thing when used in the context of MapReduce and when used in the context of functional programming.

  • In MapReduce, the system groups the results of the map function by a given key and then calls the reduce operation to aggregate values for each group (so reduce is called once for each group). You can see it as a function (K, [V]) -> R taking the group key K together with all the values belonging to the group [V] and producing some result.

  • In functional programming, reduce is a function that aggregates elements of some collection when you give it an operation that can combine two elements. In other words, you define a function (V, V) -> V and the reduce function uses it to aggregate a collection [V] into a single value V.

When you want to add numbers [1,2,3,4] using + as the function, the reduce function can do it in a number of ways:

  1. It can run from the start and calculate ((1+2)+3)+4)
  2. It can also calculate a = 1+2 and b = 3+4 in parallel and then add a+b!

The foldLeft operation is, by definition always proceeding from the left and so it always uses the evaluation strategy of (1). In fact, it also takes an initial value, so it evaluates something more like (((0+1)+2)+3)+4). This makes foldLeft useful for operations where the order matters, but it also means that it cannot be implemented for unordered collections (because you do not know what "left" is).

Tomas Petricek
  • 240,744
  • 19
  • 378
  • 553