How can I calculate exact median with Apache Spark?

Question

This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?

the percentile_approx does not work for groups with even number of entries. To make it work there, a probable workaround is to take the average of 50th percentile and the next value. In pyspark something below works.. `((F.percentile_approx('val', 0.5)+ F.percentile_approx('val', 0.500000000001))*.5).alias('med_val2') )` — prashanth, Jul 19 '23 at 12:32

Eugene Zhulenev · Accepted Answer · 2015-01-28T14:21:45.837

19

You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

  import org.apache.spark.SparkContext._

  val rdd: RDD[Int] = ???

  val sorted = rdd.sortBy(identity).zipWithIndex().map {
    case (v, idx) => (idx, v)
  }

  val count = sorted.count()

  val median: Double = if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
  } else sorted.lookup(count / 2).head.toDouble

edited Jan 28 '15 at 14:21

answered Jan 26 '15 at 23:31

Eugene Zhulenev

9,714
2
30
40

what is this "lookup" method ? AFAIK it does not exist in RDD. – WestCoastProjects Jan 28 '15 at 10:21
@javadba yeah, you need to import SparkContext._ to bring PairRDD implicits in scope – Eugene Zhulenev Jan 28 '15 at 14:21
2

p.s. I think that there are faster algorithms for finding median that don't require full sorting (http://en.wikipedia.org/wiki/Selection_algorithm) – Eran Medan May 20 '15 at 17:45
unfortunately they are not applicable to distributed RDD – Eugene Zhulenev Oct 28 '15 at 20:57
1

Can DataFrame API be used instead of RDD API? – Geoffrey Anderson Jul 08 '16 at 16:31
Yes, see http://stackoverflow.com/questions/31432843/how-to-find-median-using-spark – FrankGT Feb 21 '17 at 14:35
@EugeneZhulenev It would be better to persist the sorted RDD so it that it won't recompute the DAG while doing lookups – kjsr7 Apr 14 '22 at 10:03

Shaido · Answer 2 · 2020-03-05T12:20:51.637

7

Using Spark 2.0+ and the DataFrame API you can use the approxQuantile method：

def approxQuantile(col: String, probabilities: Array[Double], relativeError: Double)

It will also work on multiple columns at the same time since Spark version 2.2. By setting probabilites to Array(0.5) and relativeError to 0, it will compute the exact median. From the documentation:

The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive.

Despite this, there seems to be some issues with the precision when setting relativeError to 0, see the question here. A low error close to 0 will in some instances work better (will depend on Spark version).

A small working example which calculates the median of the numbers from 1 to 99 (both inclusive) and uses a low relativeError:

val df = (1 to 99).toDF("num")
val median = df.stat.approxQuantile("num", Array(0.5), 0.001)(0)
println(median)

The median returned is 50.0.

edited Mar 05 '20 at 12:20

answered Dec 14 '17 at 03:44

Shaido

27,497
23
70
73

Monica, do you know why when I run your code, I get NameError: name 'Array' is not defined? this does not seem like it's a package i need to import – bernando_vialli Jan 02 '20 at 21:34
@mathlover: Are you using Scala? Maybe you overwrite the name somewhere with a variable? – Shaido Jan 03 '20 at 01:13
I am using PySpark – bernando_vialli Jan 03 '20 at 13:34
@mathlover: Then it's not surprising you can't use a Scala version straight off. You need to adapt it a bit. – Shaido Jan 03 '20 at 13:47

How can I calculate exact median with Apache Spark?

2 Answers2

Linked