groupByKey vs. aggregateByKey - where exactly does the difference come from?

Question

There is some scary language in the docs of groupByKey, warning that it can be "very expensive", and suggesting to use aggregateByKey instead whenever possible.

I am wondering whether the difference in cost comes from the fact, that for some aggregattions, the entire group never never needs to be collected and loaded to the same node, or if there are other differences in implementation.

Basically, the question is whether rdd.groupByKey() would be equivalent to rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _) or if it would still be more expensive.

`I am wondering whether the difference in cost comes from the fact, that for some aggregattions, the entire group never never needs to be collected and loaded to the same node, or if there are other differences in implementation.` Exactly — T. Gawęda, Sep 20 '17 at 11:25
The people, voting to close - care to explain? "Not programming"? Huh? — Dima, Sep 20 '17 at 11:40
Claim: In majority of cases `rdd.groupByKey()` will be significantly cheaper than `rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _)`. I made this point [here](https://stackoverflow.com/a/39316189/1560062) and with @eliasah [here](https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey) (external link). — zero323, Sep 20 '17 at 12:54

Knight71 · Accepted Answer · 2017-09-20T12:50:29.253

If you are reducing to single element instead of list.

For eg: like word count then aggregateByKey performs better because it will not cause shuffle as explained in the link performance of group by vs aggregate by.

But in your case you are merging to a list . In the case of aggregateByKey it will first reduce all the values for a key in a partition to a single list and then send the data for shuffle.This will create as many list as partitions and memory for that will be high.

In the case of groupByKey the merge happens only at one node responsible for the key. The number of list created will be only one per key here. In case of merging to list then groupByKey is optimal in terms of memory.

Also Refer: SO Answer by zero323

I am not sure about your use case. But if you can limit the number of elements in the list in the end result then certainly aggregateByKey / combineByKey will give much better result compared to groupByKey. For eg: If you want to take only top 10 values for a given key. Then you could achieve this efficiently by using combineByKey with proper merge and combiner functions than groupByKey and take 10.

Is it safe to assume that the benefits of (combine/aggregate/reduce)ByKey are properly utilized only when data resides on multiple partitions and the function used can be applied as a Combiner on each partition? — philantrovert, Sep 20 '17 at 12:56
The assumption is correct. Also I have provided a use case apart from where these operations are effective. In short, when you data can be shrinked use (aggregate/reduce/combine)ByKey. — Knight71, Sep 20 '17 at 12:59

score -1 · Answer 2 · answered Sep 20 '17 at 11:53

Let me help to illustrate why groupByKey operation will lead to much more cost

By understanding the semantic of this specific operation, what the reduce task need to do is group the whole values associated with a single unique key.

In a word, let us have a look at it's signature

def groupByKey(): RDD[(K, Iterable[V])]

Because the "groupby" operation, all values associated with this key partitioned on different nodes can not be pre-merged. Huge mount of data transfer through over the network, lead to high network io load.

But aggregateByKey is not the same with it. let me clarify the signature:

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

How the spark engine implement this semantic of operation is as follows:

In partition it will have pre-merged operation, mean that "a specific reducer" just need to fetch all the pre-merged intermediate result of the shuffle map.

This will make the network io significantly light.

So, you seem to be saying that `rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _)` is indeed equivalent to `rdd.grouByKey`. Right? — Dima, Sep 20 '17 at 12:02
Then I don't understand what you are saying. The result of my aggregate has all elements for a key on the same node. Isn't that what you said was causing the cost of `groupBy`? — Dima, Sep 20 '17 at 12:09

groupByKey vs. aggregateByKey - where exactly does the difference come from?

2 Answers2