How to max value and keep all columns (for max records per group)?

Question

Given the following DataFrame:

+----+-----+---+-----+
| uid|    k|  v|count|
+----+-----+---+-----+
|   a|pref1|  b|  168|
|   a|pref3|  h|  168|
|   a|pref3|  t|   63|
|   a|pref3|  k|   84|
|   a|pref1|  e|   84|
|   a|pref2|  z|  105|
+----+-----+---+-----+

How can I get the max value from uid, k but include v?

+----+-----+---+----------+
| uid|    k|  v|max(count)|
+----+-----+---+----------+
|   a|pref1|  b|       168|
|   a|pref3|  h|       168|
|   a|pref2|  z|       105|
+----+-----+---+----------+

I can do something like this but it will drop the column "v" :

df.groupBy("uid", "k").max("count")

score 15 · Accepted Answer · answered Mar 07 '17 at 20:40

It's the perfect example for window operators (using over function) or join.

Since you've already figured out how to use windows, I focus on join exclusively.

scala> val inventory = Seq(
     |   ("a", "pref1", "b", 168),
     |   ("a", "pref3", "h", 168),
     |   ("a", "pref3", "t",  63)).toDF("uid", "k", "v", "count")
inventory: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 2 more fields]

scala> val maxCount = inventory.groupBy("uid", "k").max("count")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]

scala> maxCount.show
+---+-----+----------+
|uid|    k|max(count)|
+---+-----+----------+
|  a|pref3|       168|
|  a|pref1|       168|
+---+-----+----------+

scala> val maxCount = inventory.groupBy("uid", "k").agg(max("count") as "max")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]

scala> maxCount.show
+---+-----+---+
|uid|    k|max|
+---+-----+---+
|  a|pref3|168|
|  a|pref1|168|
+---+-----+---+

scala> maxCount.join(inventory, Seq("uid", "k")).where($"max" === $"count").show
+---+-----+---+---+-----+
|uid|    k|max|  v|count|
+---+-----+---+---+-----+
|  a|pref3|168|  h|  168|
|  a|pref1|168|  b|  168|
+---+-----+---+---+-----+

It always depends on the size of the data + readability. If the size is not that big and you simply like the Spark API, it's just a personal taste what you want to use, no? — Jacek Laskowski, Feb 11 '19 at 07:59
Can someone please show me the above example using Java api? I am a beginner in Spark (using Java), and having tough time in using column alias — sutanu dalui, Feb 18 '19 at 05:49

score 12 · Answer 2 · answered Mar 07 '17 at 01:24

12

Here's the best solution I came up with so far:

val w = Window.partitionBy("uid","k").orderBy(col("count").desc)

df.withColumn("rank", dense_rank().over(w)).select("uid", "k","v","count").where("rank == 1").show

answered Mar 07 '17 at 01:24

jfgosselin

395
1
2
10

score 11 · Answer 3 · answered Mar 06 '17 at 22:22

11

You can use window functions:

from pyspark.sql.functions import max as max_
from pyspark.sql.window import Window

w = Window.partitionBy("uid", "k")

df.withColumn("max_count", max_("count").over(w))

answered Mar 06 '17 at 22:22

1d210d2d0

111
2

2

almost, it adds a column with the max value but it keeps all the rows. – jfgosselin Mar 06 '17 at 22:48

How to max value and keep all columns (for max records per group)?

3 Answers3

Linked