I am using a Robust Z-Score method to find anomalies in many columns using Spark SQL. Unfortunately, this involves calculating many medians, which is unfortunately very inefficient. I did some searching but can't find any built-in efficient libraries for approximate or fast median-calculations.
Every time I run my query, which involves the following "sqlContext.sql ("SELECT percentile_approx(" + colname + ", 0.5) FROM partitioned")", I end up receiving the following error:
Name: java.lang.OutOfMemoryError
Message: GC overhead limit exceeded
So I am assuming this method is definitely not too usable in practice. I can post portions of my code if necessary (I haven't because it is a bit convoluted at the moment, but I can if required). My dataset has at most 500k points, so do you guys think this is an issue of inefficient caching(), data usage on my end, or do I need a better method of finding the median?