Suppose we have a Pyspark dataframe consists of three column: Department, Employee ID, Salary. Each department will have several employees (with associated Employee ID). Each employee has a unique salary.
We would like to groupBy "Department" and then find the median salary of that department using some aggregation function (agg func) . The problem here is that, no median function in SQL function is given, so we need to implement that function by ourself.
One of idea is to use agg func collectList() to collect all salaries in a Group (Department) and write another UDF that find the median in a list. However, this will be very slow. Is there any better option regarding calculation speed. I do not have much experience with Scala, so everything should be in Python.
Thanks