approxQuantile give incorrect Median in Spark (Scala)?

Question

I have this test data:

 val data = List(
        List(47.5335D),
        List(67.5335D),
        List(69.5335D),
        List(444.1235D),
        List(677.5335D)
      )

I'm expecting median to be 69.5335. But when I try to find exact median with this code:

df.stat.approxQuantile(column, Array(0.5), 0)

It gives me: 444.1235

Why is this so and how it can be fixed?

I'm doing it like this:

      val data = List(
        List(47.5335D),
        List(67.5335D),
        List(69.5335D),
        List(444.1235D),
        List(677.5335D)
      )

      val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_))
      val schema = StructType(Array(
        StructField("value", DataTypes.DoubleType, false)
      ))

      val df = sqlContext.createDataFrame(rdd, schema)
      df.createOrReplaceTempView(tableName)
val df2 = sc.sql(s"SELECT value FROM $tableName")
val median = df2.stat.approxQuantile("value", Array(0.5), 0)

So I'm creating temp table. Then search inside it and then calculate result. It's just for testing.

I'm having the same problem. Any suggestion on how to solve this? — Nima Mousavi, Jul 05 '18 at 10:12
@Nimi As I can remember I have solved it writing my own udf. — sergeda, Jul 06 '18 at 03:53
Would you mind sharing? I don't know how to aggregate the values of a column using a udf. I'd like to keep the calculations in spark and not extract the values. — Nima Mousavi, Jul 06 '18 at 05:53

score 3 · Answer 1 · answered Mar 20 '17 at 12:59

3

Note that this is an approximate quantiles computation. It is not supposed to give you the exact answer all the time. See here for a more thorough explanation.

The reason is that for very large datasets, sometimes you are OK with an approximate answer, as long as you get it significantly faster than the exact computation.

answered Mar 20 '17 at 12:59

Amir

888
9
18

6

But in the documentation https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/DataFrameStatFunctions.html#approxQuantile(java.lang.String,%20double[],%20double) they states **relativeError - The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed** – sergeda Mar 21 '17 at 11:35

score 0 · Answer 2 · answered Mar 08 '17 at 11:47

0

This is the result from my local. Do you do something similar?

 val data = List(
        List(47.5335D),
        List(67.5335D),
        List(69.5335D),
        List(444.1235D),
        List(677.5335D)
      )

val df = data.flatten.toDF

df.stat.approxQuantile("value", Array(0.5), 0)
// res18: Array[Double] = Array(67.5335)

answered Mar 08 '17 at 11:47

Alex Karpov

564
4
13

Hmm, strange. Another version but still not 69.5335. I've added all source to my question. – sergeda Mar 08 '17 at 12:02

Jeffan · Answer 3 · 2019-08-19T04:31:59.700

0

I encountered this similar problem when trying to use the approxQuantile() method with Spark-2.2.1. When I upgraded to Spark-2.4.3, approxQuantile() now returns the right exact median.

edited Aug 19 '19 at 04:31

answered Aug 19 '19 at 00:08

Jeffan

1
1
2

approxQuantile give incorrect Median in Spark (Scala)?

3 Answers3

Linked

Related