reduce() vs. fold() in Apache Spark

Question

What is the difference between reduce vs. fold with respect to their technical implementation?

I understand that they differ by their signature as fold accepts additional parameter (i.e. initial value) which gets added to each partition output.

Can someone tell about use case for these two actions?
Which would perform better in which scenario consider 0 is used for fold?

Thanks in advance.

I think that this not spark dependent, sonwill be better to read other resources or questions http://stackoverflow.com/questions/25158780/difference-between-reduce-and-foldleft-fold-in-functional-programming-particula — anquegi, Mar 17 '16 at 10:24
@anquegi, the difference that reduce can be parallelized (which is explained in the answer you link to) does have a major impact in Spark, though. It has nothing to do with Hadoop,.. and is really a duplicate of that question — The Archetypal Paul, Mar 17 '16 at 11:02
@TheArchetypalPaul fold (not foldLeft / foldRight) can be (and is) parallelized as well. — zero323, Mar 17 '16 at 11:05
@zero323 Ah, good point, I'd assumed foldLeft was meant. Voting to reopen — The Archetypal Paul, Mar 17 '16 at 11:50
Looking at the source, they'll run in essentially the same time. `fold` calls `fold` on an iterator for each partition, then merges the results, `reduce` calls `reduceLeft` on the iterator for each partition then merges the result. The difference is that `fold` doesn't need to worry about empty partitions or collections, because then it will just use the zero value. The performance difference probably isn't even measurable. — The Archetypal Paul, Mar 17 '16 at 11:57
@TheArchetypalPaul Oh, I didn't see your second comment. I'll turn the answer into wiki then :) You can leverage difference in semantics for performance though (http://stackoverflow.com/q/34529953/1560062) — zero323, Mar 17 '16 at 12:11
@zero323, no problem, it's a good answer and nether of us need the rep anyway :) — The Archetypal Paul, Mar 17 '16 at 12:32
@TheArchetypalPaul Don't say that. I've heard they give free stickers when you get to 100k. Nevertheless I paid the price. Looking into Scala source is always traumatic. _abyss stares back at you_ you know. — zero323, Mar 17 '16 at 12:40

score 10 · Accepted Answer · edited May 23 '17 at 11:55

There is no practical difference when it comes to performance whatsoever:

RDD.fold action is using fold on the partition Iterators which is implemented using foldLeft.
RDD.reduce is using reduceLefton the partition Iterators.

Both methods keep mutable accumulator and process partitions sequentially using simple loops with foldLeft implemented like this:

foreach (x => result = op(result, x))

and reduceLeft like this:

for (x <- self) {
  if (first) {
    ...
  }
  else acc = op(acc, x)
}

Practical difference between these methods in Spark is only related to their behavior on empty collections and ability to use mutable buffer (arguably it is related to performance). You'll find some discussion in Why is the fold action necessary in Spark?

Moreover there is no difference in the overall processing model:

Each partition is processed sequentially using a single thread.
Partitions are processed in parallel using multiple executors / executor threads.
Final merge is performed sequentially using a single thread on the driver.

reduce() vs. fold() in Apache Spark

1 Answers1

Linked