0

Say I have this dataframe:

  val df = Seq(("Mike",1),("Kevin",2),("Bob",3),("Steve",4)).toDF("name","score")

enter image description here

and I want to filter this dataframe so that it only returns rows where the "score" column is greater than on equal to the 75th percentile. How would I do this?

Thanks so much and have a great day!

koh-ding
  • 105
  • 7
  • Does this answer your question? [How to find median and quantiles using Spark](https://stackoverflow.com/questions/31432843/how-to-find-median-and-quantiles-using-spark) – jrook Nov 02 '20 at 18:36
  • Thanks, but how would I add this as a new column to the dataframe with the corresponding percentiles? – koh-ding Nov 02 '20 at 18:42

1 Answers1

1

What you want to base your filter on is the upper quartile.

It is also known as the upper quartile or the 75th empirical quartile and 75% of the data lies below this point.

Based on the answer here, you can use spark's approximateQuantile to get what you want:

val q = df.stat.approxQuantile("score", Array(.75), 0)
q: Array[Double] = Array(3.0)

This array(q) gives you the boundary between 3rd and 4th quartiles.

Using a simple spark filter should get you what you want:

df.filter($"score" >= q.head).show
+-----+-----+
| name|score|
+-----+-----+
|  Bob|    3|
|Steve|    4|
+-----+-----+
jrook
  • 3,459
  • 1
  • 16
  • 33
  • 1
    @koh-ding : if this answer solves your problem, please accept it. If you need something more, please let me know so I can improve the answer. – jrook Nov 04 '20 at 17:28