0

What would be the fundamental difference with using .GroupByKey and .GroupBy when I am using a column name of a DF as a parameter?

Which one is time efficient and how exactly does each mean can someone please explain in detail as I went through some examples but it was confusing.

Sundeep Pidugu
  • 2,377
  • 2
  • 21
  • 43

1 Answers1

3

There is no groupByKey method that takes Column as an argument. There are methods which take functions, either:

def groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T] 

or

def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T] 

Compared to groupBy that takes Columns:

def groupBy(cols: Column*): RelationalGroupedDataset 

or String

def groupBy(col1: String, cols: String*): RelationalGroupedDataset 

the difference should be obvious - the first two return KeyValueGroupedDataset (intended for processing with "functional", "strongly typed API, like mapGroups or reduceGroups), while the later methods returnRelationalGroupedDataset` (intended for processing with SQL-like API).

In general see: