How to find quantiles inside agg() function after groupBy in Scala SPARK

Question

I have a dataframe, in which I want to groupBy column A then find different stats like mean, min, max, std dev and quantiles.

I am able to find min, max and mean using the following code: df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)

But I am unable to find the quantiles(0.25, 0.5, 0.75). I tried approxQuantile and percentile but it gives the following error:

error: not found: value approxQuantile

I hope you are trying to take some sample data from the data frame / dataset. Then spark had `sample(fraction: Double)` API exist. Please try that one — Ravi, Sep 03 '19 at 07:16
Possible duplicate of [How to use approxQuantile by group?](https://stackoverflow.com/questions/53548964/how-to-use-approxquantile-by-group) — George Leung, Sep 03 '19 at 07:26

Raphael Roth · Accepted Answer · 2019-09-03T10:52:18.950

if you have Hive in classpath, you can use many UDAF like percentile_approx and stddev_samp, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)

You can call these functions using callUDF:

import ss.implicits._
import org.apache.spark.sql.functions.callUDF

val df = Seq(1.0,2.0,3.0).toDF("x")

df.groupBy()
  .agg(
    callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
    callUDF("stddev_samp",$"x").as("stdev")
  )
.show()

score -1 · Answer 2 · answered Mar 25 '22 at 17:16

Here is a code that I have tested on Spark 3.1

val simpleData = Seq(("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  )
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()


df.groupBy($"department")
.agg(
 percentile_approx($"salary",lit(0.5), lit(10000))
)
.show(false)

Output

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|        James|     Sales|   NY| 90000| 34|10000|
|      Michael|     Sales|   NY| 86000| 56|20000|
|       Robert|     Sales|   CA| 81000| 30|23000|
|        Maria|   Finance|   CA| 90000| 24|23000|
|        Raman|   Finance|   CA| 99000| 40|24000|
|        Scott|   Finance|   NY| 83000| 36|19000|
|          Jen|   Finance|   NY| 79000| 53|15000|
|         Jeff| Marketing|   CA| 80000| 25|18000|
|        Kumar| Marketing|   NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+

+----------+-------------------------------------+
|department|percentile_approx(salary, 0.5, 10000)|
+----------+-------------------------------------+
|Sales     |86000                                |
|Finance   |83000                                |
|Marketing |80000                                |
+----------+-------------------------------------+

percentile_approx is not a function. the signature is "approx_percentile" — Dylan, Apr 04 '23 at 18:22
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.percentile_approx.html — Xavier John, Apr 04 '23 at 21:28
You are correct, I apologize for the mistake. I was reading too quickly. — Dylan, Apr 05 '23 at 22:24

How to find quantiles inside agg() function after groupBy in Scala SPARK

2 Answers2