Let's work through what is actually needed logically.
First, note that if your collection is unordered, any set of (binary) operations on it need to be both commutative and associative, or you'll get different answers depending on which (arbitrary) order you choose each time. Since reduce
, fold
, and aggregate
all use binary operations, if you use these things on a collection that is unordered (or is viewed as unordered), everything must be commutative and associative.
reduce
is an implementation of the idea that if you can take two things and turn them into one thing, you can collapse an arbitrarily long collection into a single element. Associativity is exactly the property that it doesn't matter how you pair things up as long as you eventually pair them all and keep the left-to-right order unchanged, so that's exactly what you need.
a b c d a b c d a b c d
a # b c d a # b c d a b # c d
(a#b) c # d (a#b) # c d a (b#c) d
(a#b) # (c#d) ((a#b)#c) # d a # ((b#c)#d)
All of the above are the same as long as the operation (here called #
) is associative. There is no reason to swap around which things go on the left and which go on the right, so the operation does not need to be commutative (addition is: a+b == b+a; concat is not: ab != ba).
reduce
is mathematically simple and requires only an associative operation
Reduce is limited, though, in that it doesn't work on empty collections, and in that you can't change the type. If you're working sequentially, you can a function that takes a new type and the old type, and produces something with the new type. This is a sequential fold (left-fold if the new type goes on the left, right-fold if it goes on the right). There is no choice about the order of operations here, so commutativity and associativity and everything are irrelevant. There's exactly one way to work through your list sequentially. (If you want your left-fold and right-fold to always be the same, then the operation must be associative and commutative, but since left- and right-folds don't generally get accidentally swapped, this isn't very important to ensure.)
The problem comes when you want to work in parallel. You can't sequentially go through your collection; that's not parallel by definition! So you have to insert the new type at multiple places! Let's call our fold operation @
, and we'll say that the new type goes on the left. Furthermore, we'll say that we always start with the same element, Z
. Now we could do any of the following (and more):
a b c d a b c d a b c d
Z@a b c d Z@a b Z@c d Z@a Z@b Z@c Z@d
(Z@a) @ b c d (Z@a) @ b (Z@c) @ d
((Z@a)@b) @ c d
(((Z@a)@b)@c) @ d
Now we have a collection of one or more things of the new type. (If the original collection was empty, we just take Z
.) We know what to do with that! Reduce! So we make a reduce operation for our new type (let's call it $
, and remember it has to be associative), and then we have aggregate:
a b c d a b c d a b c d
Z@a b c d Z@a b Z@c d Z@a Z@b Z@c Z@d
(Z@a) @ b c d (Z@a) @ b (Z@c) @ d Z@a $ Z@b Z@c $ Z@d
((Z@a)@b) @ c d ((Z@a)@b) $ ((Z@c)@d) ((Z@a)$(Z@b)) $ ((Z@c)$(Z@d))
(((Z@a)@b)@c) @ d
Now, these things all look really different. How can we make sure that they end up to be the same? There is no single concept that describes this, but the Z@
operation has to be zero-like and $
and @
have to be homomorphic, in that we need (Z@a)@b == (Z@a)$(Z@b)
. That's the actual relationship that you need (and it is technically very similar to a semigroup homomorphism). There are all sorts of ways to pick badly even if everything is associative and commutative. For example, if Z
is the double value 0.0
and @
is actually +
, then Z
is zero-like and @
is associative and commutative. But if $
is actually *
, which is also associative and commutative, everything goes wrong:
(0.0+2) * (0.0+3) == 2.0 * 3.0 == 6.0
((0.0+2) + 3) == 2.0 + 3 == 5.0
One example of a non-trival aggregate is building a collection, where @
is the "append an element" operator and $
is the "concat two collections" operation.
aggregate
is tricky and requires an associative reduce operation, plus a zero-like value and a fold-like operation that is homomorphic to the reduce
The bottom line is that aggregate
is not simply a generalization of reduce
.
But there is a simplification (less general form) if you're not actually changing the type. If Z
is actually z
and is an actual zero, we can just stick it in wherever we want and use reduce. Again, we don't need commutativity conceptually; we just stick in one or more z
's and reduce, and our @
and $
operations can be the same thing, namely the original #
we used on the reduce
a b c d () <- empty
z#a z#b z
z#a (z#b)#c
z#a ((z#b)#c)#d
(z#a)#((z#b)#c)#d
If we just delete the z
's from here, it works perfectly well, and in fact is equivalent to if (empty) z else reduce
. But there's another way it could work too. If the operation #
is also commutative, and z
is not actually a zero but just occupies a fixed point of #
(meaning z#z == z
but z#a
is not necessarily just a
), then you can run the same thing, and since commutivity lets you switch the order around, you conceptually can reorder all the z's together at the beginning, and then merge them all together.
And this is a parallel fold, which is really a rather different beast than a sequential fold.
(Note that neither fold
nor aggregate
are strictly generalizations of reduce
even for unordered collections where operations have to be associative and commutative, as some operations do not have a sensible zero! For instance, reducing strings by shortest length has as its "zero" the longest possible string, which conceptually doesn't exist, and practically is an absurd waste of memory.)
fold
requires an associative reduce operation plus either a zero value or a reduce operation that's commutative plus a fixed-point value
Now, when would you ever use a parallel fold that wasn't just a reduceOrElse(zero)? Probably never, actually, though they can exist. For example, if you have a ring, you often have fixed points of the type we need. For instance, 10 % 45 == (10*10) % 45, and *
is both associative and commutative in integers mod 45. Thus, if our collection is numbers mod 45, we can fold with a "zero" of 10
and an operation of *
, and parallelize however we please while still getting the same result. Pretty weird.
However, note that you can just plug the zero and operation of fold
into aggregate
and get exactly the same result, so aggregate
is a proper generalization of fold
.
So, bottom line:
- Reduce requires only an associative merge operation, but doesn't change the type, and doesn't work on empty collecitons.
- Parallel fold tries to extend
reduce
but requires a true zero, or a fixed point and the merge operation must be commutative.
- Aggregate changes the type by (conceptually) running sequential folds followed by a (parallel) reduce, but there are complex relationships between the reduce operation and the fold operation--basically they have to be doing "the same thing".
- An unordered collection (e.g. a set) always requires an associative and commutative operation for any of the above.
With regard to the byKey
stuff: it's just the same as this, except it only applies it to the collection of values associated with a (potentially repeated) key.
If Spark actually requires commutativity where the above analysis does not suggest it's needed, one could reasonably consider that a bug (or at least an unnecessary limitation of the implementation, given that operations like map
and filter
preserve order on ordered RDDs).