0

Suppose we have a Pyspark dataframe consists of three column: Department, Employee ID, Salary. Each department will have several employees (with associated Employee ID). Each employee has a unique salary.

We would like to groupBy "Department" and then find the median salary of that department using some aggregation function (agg func) . The problem here is that, no median function in SQL function is given, so we need to implement that function by ourself.

One of idea is to use agg func collectList() to collect all salaries in a Group (Department) and write another UDF that find the median in a list. However, this will be very slow. Is there any better option regarding calculation speed. I do not have much experience with Scala, so everything should be in Python.

Thanks

zero323
  • 322,348
  • 103
  • 959
  • 935
  • You can use `df.approxQuantile("Salary", [0.5], 0.25)` after aggregating by department: https://stackoverflow.com/questions/31432843/how-to-find-median-and-quantiles-using-spark – Alex Apr 16 '18 at 07:46
  • 1
    ..or if you want to use Hive SQL you can use: `sqlContext.sql("SELECT percentile(Salary, 0.5) FROM df")` – Alex Apr 16 '18 at 07:53
  • And possible duplicate of [How to find the median in Apache Spark with Python Dataframe API?](https://stackoverflow.com/q/38743476/6910411) – zero323 Apr 16 '18 at 09:44
  • If the duplicates are not helping, please post an example of the data – user3689574 Apr 16 '18 at 10:06

0 Answers0