0

I have a DataFrame df like this one:

df =

name  group   influence
A     1       2
B     1       3
C     1       0
A     2       5
D     2       1

For each distinct value of group, I want to extract the value of name that has the maximum value of influence.

The expected result is this one:

group  max_name   max_influence
1      B          3
2      A          5

I know how to get max value but I don't know how to getmax_name.

df.groupBy("group").agg(max("influence").as("max_influence")
Markus
  • 3,562
  • 12
  • 48
  • 85

1 Answers1

-3

There is good alternative to groupBy with structs - window functions, which sometimes are really faster. For your examle I would try the following:

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val res = df.withColumn("max_influence", max('influence).over(w))
  .filter('influence === 'max_influence)
res.show
+----+-----+---------+-------------+
|name|group|influence|max_influence|
+----+-----+---------+-------------+
|   A|    2|        5|            5|
|   B|    1|        3|            3|
+----+-----+---------+-------------+

Now all you need is to drop useless columns and rename remaining ones. Hope, it'll help.