I wonder if there is a more efficient way in spark to find the most frequent value of a set of columns than using rank()
in order to use it as an imputation for missing values.
E.g. in spark-sql I could formulate something similar like how to select the most frequently appearing values? per column. This solution works for a single column using rank. What I am looking for is a) a more efficient variant (as the first answer outlines ) and b) something which is more optimal than using a for loop and the solution of a) to apply for multiple columns.
Do you see any possibility to optimize this in spark?
edit
An example. Here is a small Dataset
case class FooBarGG(foo: Int, bar: String, baz: String, dropme: String)
val df = Seq((0, "first", "A", "dropme"), (1, "second", "A", "dropme2"),
(0, "first", "B", "foo"),
(1, "first", "C", "foo"))
.toDF("foo", "bar", "baz", "dropme").as[FooBarGG]
val columnsFactor = Seq("bar", "baz")
val columnsToDrop = Seq("dropme")
val factorCol= (columnsFactor ++ columnsToDrop).map(c => col(c))
With the query from the answer
df.groupBy(factorCol: _*).count.agg(max(struct($"count" +: factorCol: _*)).alias("mostFrequent")).show
+--------------------+
| mostFrequent|
+--------------------+
|[1,second,A,dropme2]|
+--------------------+
|-- mostFrequent: struct (nullable = true)
| |-- count: long (nullable = false)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
| |-- dropme: string (nullable = true)
Is the result but for column bar -> first, baz -> A and for drompe -> foo are the single top1 most frequent values, which are different from the returned result.