Spark DataFrame/Dataset Find most common value for each key Efficient way

Question

Problem: I have a problem to map most common value of a key in spark(using scala). I have done it with RDD but don't know how to do efficiently with DF/DS(sparksql)

dataset is like

key1 = value_a
key1 = value_b
key1 = value_b
key2 = value_a
key2 = value_c
key2 = value_c
key3 = value_a

After spark transformation and access output should be each key with its common value

Output

key1 = valueb
key2 = valuec
key3 = valuea

Tried until now:

RDD

I have tried to map and reduce by group of (key,value),count in RDD and it makes logic but I cant translate this into sparksql(DataFrame/Dataset) (as I want minimum shuffle across network)

Here is my code for RDD

 val data = List(

"key1,value_a",
"key1,value_b",
"key1,value_b",
"key2,value_a",
"key2,value_c",
"key2,value_c",
"key3,value_a"

)

val sparkConf = new SparkConf().setMaster("local").setAppName("example")
val sc = new SparkContext(sparkConf)

val lineRDD = sc.parallelize(data)

val pairedRDD = lineRDD.map { line =>
val fields = line.split(",")
(fields(0), fields(2))
}

val flatPairsRDD = pairedRDD.flatMap {
  (key, val) => ((key, val), 1)
}

val SumRDD = flatPairsRDD.reduceByKey((a, b) => a + b)




val resultsRDD = SumRDD.map{
  case ((key, val), count) => (key, (val,count))
 }.groupByKey.map{
  case (key, valList) => (name, valList.toList.sortBy(_._2).reverse.head)
}

resultsRDD.collect().foreach(println)

DataFrame , Using Windowing: I am trying with Window.partitionBy("key", "value") to aggregate the count over the window. and thn sorting and agg() respectively

you'll need to use a window function after a group by key, value with count, sort on count and get the first ranked row. You can check this https://stackoverflow.com/questions/33878370/how-to-select-the-first-row-of-each-group/ — eliasah, Nov 14 '17 at 16:28
@eliasah no problem, please write when it is possible for you — A.B, Nov 14 '17 at 16:39
@eliasah please post your answer if it further optimizes the operation — A.B, Nov 16 '17 at 10:27

Ramesh Maharjan · Accepted Answer · 2017-11-14T17:47:29.107

According to what I understood from your question here's what you can do

First you have to read the data and convert it to dataframe

val df = sc.textFile("path to the data file")   //reading file line by line
  .map(line => line.split("="))                 // splitting each line by =
  .map(array => (array(0).trim, array(1).trim)) //tuple2(key, value) created
  .toDF("key", "value")                        //rdd converted to dataframe which required import sqlContext.implicits._

which would be

+----+-------+
|key |value  |
+----+-------+
|key1|value_a|
|key1|value_b|
|key1|value_b|
|key2|value_a|
|key2|value_c|
|key2|value_c|
|key3|value_a|
+----+-------+

Next step would be to count the repetition of identical values for each key and select the value that repeated the most for each key which can be done by using Window function, and aggregations as below

import org.apache.spark.sql.expressions._                   //import Window library
def windowSpec = Window.partitionBy("key", "value")         //defining a window frame for the aggregation
import org.apache.spark.sql.functions._                     //importing inbuilt functions
df.withColumn("count", count("value").over(windowSpec))     // counting repeatition of value for each group of key, value and assigning that value to new column called as count
  .orderBy($"count".desc)                                   // order dataframe with count in descending order
  .groupBy("key")                                           // group by key
  .agg(first("value").as("value"))                          //taking the first row of each key with count column as the highest

thus the final output should be equal to

+----+-------+
|key |value  |
+----+-------+
|key3|value_a|
|key1|value_b|
|key2|value_c|
+----+-------+

@RameshMaharjan Thanks for the answer. Was curious the performance difference using `groupBy` in [this answer](https://stackoverflow.com/a/65767105/8932910) compared to `Window` function, what do you think? — jack, Jan 17 '21 at 23:38
Very nice answer! Good to keep in mind that this approach will struggle with skew if a particular key-value combination occurs very often (e.g. 20+ million such records, for executors with 4GB RAM). — abeboparebop, Aug 09 '23 at 09:00

score 0 · Answer 2 · answered Jan 17 '21 at 23:36

What about using groupBy?

val maxFreq= udf((values: List[Int]) => {
  values.groupBy(identity).mapValues(_.size).maxBy(_._2)._1
})

df.groupBy("key")
  .agg(collect_list("value") as "valueList")
  .withColumn("mostFrequentValue", maxFreq(col("valueList")))

Spark DataFrame/Dataset Find most common value for each key Efficient way

2 Answers2