Given the following DataFrame
in Spark
+-----+------+---------+----+---------+----+----+------+ │
|empno| ename| job| mgr| hiredate| sal|comm|deptno| │
+-----+------+---------+----+---------+----+----+------+ │
| 7369| SMITH| CLERK|7902|17-Dec-80| 800| 20| 10| │
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300| 30| │
| 7521| WARD| SALESMAN|7698|22-Feb-81|1250| 500| 30| │
| 7566| JONES| MANAGER|7839| 2-Apr-81|2975| 0| 20| │
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400| 30| │
| 7698| BLAKE| MANAGER|7839| 1-May-81|2850| 0| 30| │
| 7782| CLARK| MANAGER|7839| 9-Jun-81|2450| 0| 10| │
| 7788| SCOTT| ANALYST|7566|19-Apr-87|3000| 0| 20| │
| 7839| KING|PRESIDENT| 0|17-Nov-81|5000| 0| 10| │
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500| 0| 30| │
| 7876| ADAMS| CLERK|7788|23-May-87|1100| 0| 20| │
+-----+------+---------+----+---------+----+----+------+
I would like to create a new column mvp
which is true
if the row is the employee with the highest salary (sal
) in the department (deptno
), or false
otherwise. I have attempted this using Window
as shown below
val depWin = Window.partitionBy("depno")
df.withColumn("mvp", max("sal").over(depWin))
however, this only adds the salary of the highest paid employee in the same department to each row. How can I create this column denoting the highest paid in the department?