minBy equivalent in Spark Dataframe

Question

I'm looking for equivalent function of minBy aggregate in Spark Dataframe or may need to manually aggregate. Any thoughts? Thanks.

https://prestodb.io/docs/current/functions/aggregate.html#min_by

score 3 · Answer 1 · answered Jul 25 '18 at 03:14

3

There is no such direct function to get the 'min_by' values from the Dataframe.

It is a two stage operation in Spark. First groupby the column then apply min function to get min value for each numeric column for each group.

scala> val inputDF = Seq(("a", 1),("b", 2), ("b", 3), ("a", 4), ("a", 5)).toDF("id", "count")
inputDF: org.apache.spark.sql.DataFrame = [id: string, count: int]

scala> inputDF.show()
+---+-----+
| id|count|
+---+-----+
|  a|    1|
|  b|    2|
|  b|    3|
|  a|    4|
|  a|    5|
+---+-----+

scala> inputDF.groupBy($"id").min("count").show()
+---+----------+
| id|min(count)|
+---+----------+
|  b|         2|
|  a|         1|
+---+----------+

answered Jul 25 '18 at 03:14

Lakshman Battini

1,842
11
25

Awesome! Is it possible to group by multiplecolumns?? – Anthati Nagaraju Jul 28 '18 at 16:29
Yes, we can group by multi columns. – Lakshman Battini Jul 28 '18 at 16:31
1

Hi Lakshman Battini, Ok, I understand it's a two step operation. I raised this question about to know any functions available in RDD/DataFrame level to replace an existing UDF. Thanks. – Kris Aug 01 '18 at 09:47

minBy equivalent in Spark Dataframe

1 Answers1