Finding medians or approximate medians on many elements efficiently

Question

I am using a Robust Z-Score method to find anomalies in many columns using Spark SQL. Unfortunately, this involves calculating many medians, which is unfortunately very inefficient. I did some searching but can't find any built-in efficient libraries for approximate or fast median-calculations.

Every time I run my query, which involves the following "sqlContext.sql ("SELECT percentile_approx(" + colname + ", 0.5) FROM partitioned")", I end up receiving the following error:

Name: java.lang.OutOfMemoryError
Message: GC overhead limit exceeded

So I am assuming this method is definitely not too usable in practice. I can post portions of my code if necessary (I haven't because it is a bit convoluted at the moment, but I can if required). My dataset has at most 500k points, so do you guys think this is an issue of inefficient caching(), data usage on my end, or do I need a better method of finding the median?

why don't you just sort the elements and take the size/2 th element? It is much faster and easier — GameOfThrows, Jul 05 '16 at 15:41
@GameOfThrows Is there a good way to do this in place w/o having to define a new DataFrame? I am relatively new to Spark at the moment. — Eric Staner, Jul 05 '16 at 15:50
If so it has never been a part of Spark. It is just a Hive code that happens to be compatible with Spark. — zero323, Jul 06 '16 at 17:25

score 0 · Answer 1 · edited May 23 '17 at 11:58

If you want to use Hive UDF like in the question you can provide additional argument which determines a number of records to use:

import org.apache.spark.mllib.random.RandomRDDs

RandomRDDs.normalRDD(sc, 100000).map(Tuple1(_)).toDF("x").registerTempTable("df")

sqlContext.sql("SELECT percentile_approx(x, 0.5, 100) FROM df").show()

// +--------------------+
// |                 _c0|
// +--------------------+
// |-0.02626781447291...|
// +--------------------+

sqlContext.sql("SELECT percentile_approx(x, 0.5, 10) FROM df").show()

// +-------------------+
// |                _c0|
// +-------------------+
// |-0.4185534605295841|
// +-------------------+

A default value is 10000 so while it is still expensive due to related shuffle it shouldn't lead to OOM in practice. It suggests there can be some other problems with your configuration or query which go beyond median computation itself.

On a side note Spark 2.0.0 provides a native percentile approximation method as described in How to find median using Spark.

Finding medians or approximate medians on many elements efficiently

1 Answers1