There is some scary language in the docs of groupByKey
, warning that it can be "very expensive", and suggesting to use aggregateByKey
instead whenever possible.
I am wondering whether the difference in cost comes from the fact, that for some aggregattions, the entire group never never needs to be collected and loaded to the same node, or if there are other differences in implementation.
Basically, the question is whether rdd.groupByKey()
would be equivalent to rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _)
or if it would still be more expensive.