How to get the name of the group with maximum value of parameter?

Question

I have a DataFrame df like this one:

df =

name  group   influence
A     1       2
B     1       3
C     1       0
A     2       5
D     2       1

For each distinct value of group, I want to extract the value of name that has the maximum value of influence.

The expected result is this one:

group  max_name   max_influence
1      B          3
2      A          5

I know how to get max value but I don't know how to getmax_name.

df.groupBy("group").agg(max("influence").as("max_influence")

Dmitry R. Bushkov · Answer 1 · 2018-02-09T14:10:49.433

There is good alternative to groupBy with structs - window functions, which sometimes are really faster. For your examle I would try the following:

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val res = df.withColumn("max_influence", max('influence).over(w))
  .filter('influence === 'max_influence)
res.show
+----+-----+---------+-------------+
|name|group|influence|max_influence|
+----+-----+---------+-------------+
|   A|    2|        5|            5|
|   B|    1|        3|            3|
+----+-----+---------+-------------+

Now all you need is to drop useless columns and rename remaining ones. Hope, it'll help.

How to get the name of the group with maximum value of parameter?

1 Answers1