7

Is there any pre-built Outlier Detection Algorithm/Interquartile Range identification methods available in Spark 2.0.0 ? I found some code here but i dont think this is available yet in spark2.0.0

Thanks

mjimcua
  • 2,781
  • 3
  • 27
  • 47
Sudhakar Chavan
  • 377
  • 4
  • 14
  • Got the below url for IQR , probably i will have to write some more udf to get the outliers using IQR and then take action to remove/replace outlier values. However if there is any Outlier algorithm available please help me out with the same. => http://stackoverflow.com/questions/37032689/scala-first-quartile-third-quartile-and-iqr-from-spark-sqlcontext-dataframe – Sudhakar Chavan Oct 08 '16 at 07:34

1 Answers1

4

If you don´t found a prebuilt method you can do something like that:

Example Outlier detection using Box-and-Whisker Plot:

val sampleData = List(10.2, 14.1,14.4,14.4,14.4,14.5,14.5,14.6,14.7,
               14.7, 14.7,14.9,15.1, 15.9,16.4)
val rowRDD = sparkSession.sparkContext.makeRDD(sampleData.map(value => Row(value)))
val schema = StructType(Array(StructField("value",DoubleType)))
val df = sparkSession.createDataFrame(rowRDD,schema)
val quantiles = df.stat.approxQuantile("value", Array(0.25,0.75),0.0)
val Q1 = quantiles(0)
val Q3 = quantiles(1)
val IQR = Q3 - Q1
val lowerRange = Q1 - 1.5*IQR
val upperRange = Q3+ 1.5*IQR

val outliers = df.filter(s"value < $lowerRange or value > $upperRange")
outliers.show()

solution source:

Outlier Detection using Quantiles

mjimcua
  • 2,781
  • 3
  • 27
  • 47