Difference between GroupByKey($"col") and GroupBy($"col") in spark scala

Question

What would be the fundamental difference with using .GroupByKey and .GroupBy when I am using a column name of a DF as a parameter?

Which one is time efficient and how exactly does each mean can someone please explain in detail as I went through some examples but it was confusing.

score 3 · Accepted Answer · answered Oct 23 '18 at 12:26

There is no groupByKey method that takes Column as an argument. There are methods which take functions, either:

def groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T]

or

def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]

Compared to groupBy that takes Columns:

def groupBy(cols: Column*): RelationalGroupedDataset

or String

def groupBy(col1: String, cols: String*): RelationalGroupedDataset

the difference should be obvious - the first two return KeyValueGroupedDataset (intended for processing with "functional", "strongly typed API, like mapGroups or reduceGroups), while the later methods returnRelationalGroupedDataset` (intended for processing with SQL-like API).

In general see:

Difference between GroupByKey($"col") and GroupBy($"col") in spark scala

1 Answers1