Finding and marking maximum row in a Spark SQL Window

Question

Given the following DataFrame in Spark

+-----+------+---------+----+---------+----+----+------+                                         │
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|                                         │
+-----+------+---------+----+---------+----+----+------+                                         │
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|                                         │
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|                                         │
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|                                         │
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|                                         │
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|                                         │
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|                                         │
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|                                         │
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|                                         │
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|                                         │
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|                                         │
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|                                         │
+-----+------+---------+----+---------+----+----+------+

I would like to create a new column mvp which is true if the row is the employee with the highest salary (sal) in the department (deptno), or false otherwise. I have attempted this using Window as shown below

val depWin = Window.partitionBy("depno")
df.withColumn("mvp", max("sal").over(depWin))

however, this only adds the salary of the highest paid employee in the same department to each row. How can I create this column denoting the highest paid in the department?

score 1 · Answer 1 · answered Jul 17 '18 at 22:57

1

You can do this with orderBy on your Window and row_number

val depWin = Window.partitionBy("deptno").orderBy($"sal".desc)
val ranked = df.withColumn("rank", row_number.over(depWin))
ranked.withColumn("mvp", ranked("rank") === 1).drop("rank")

answered Jul 17 '18 at 22:57

Jon Deaton

3,943
6
28
41

Whoever down-voted this - please explain why. This solution surely solved my problem and gave the desired result. The downvote suggests that there is a simpler/better was to accomplish this task and I would love to know what that it. Thank you. – Jon Deaton Jul 20 '18 at 19:26

Finding and marking maximum row in a Spark SQL Window

1 Answers1