How would I filter a dataframe by a column's percentile value in Scala Spark

Question

Say I have this dataframe:

  val df = Seq(("Mike",1),("Kevin",2),("Bob",3),("Steve",4)).toDF("name","score")

and I want to filter this dataframe so that it only returns rows where the "score" column is greater than on equal to the 75th percentile. How would I do this?

Thanks so much and have a great day!

Does this answer your question? [How to find median and quantiles using Spark](https://stackoverflow.com/questions/31432843/how-to-find-median-and-quantiles-using-spark) — jrook, Nov 02 '20 at 18:36
Thanks, but how would I add this as a new column to the dataframe with the corresponding percentiles? — koh-ding, Nov 02 '20 at 18:42

score 1 · Answer 1 · answered Nov 02 '20 at 20:44

What you want to base your filter on is the upper quartile.

It is also known as the upper quartile or the 75th empirical quartile and 75% of the data lies below this point.

Based on the answer here, you can use spark's approximateQuantile to get what you want:

val q = df.stat.approxQuantile("score", Array(.75), 0)
q: Array[Double] = Array(3.0)

This array(q) gives you the boundary between 3rd and 4th quartiles.

Using a simple spark filter should get you what you want:

df.filter($"score" >= q.head).show
+-----+-----+
| name|score|
+-----+-----+
|  Bob|    3|
|Steve|    4|
+-----+-----+

@koh-ding : if this answer solves your problem, please accept it. If you need something more, please let me know so I can improve the answer. — jrook, Nov 04 '20 at 17:28

How would I filter a dataframe by a column's percentile value in Scala Spark

1 Answers1