To Find Median value of a Data Frame in Apache Spark

Question

I'm working on a problem where I have imported a DB table into Apache Spark.

I have converted it into a DataFrame. Then I performed a RegisterTempTable so that I can use Hive Queries on it.

I'm able to perform other mathematical operations like,

sqlContext.sql("select avg(Amount) from Table1001").show

However I'm unable to find the median for a field called Amount. Is there any way to find the median on this DataFrame?

Kindly provide a suitable solution.

How do you find the `median`? Step 1 - sort; Step2 - pick the middle element. — sarveshseri, Dec 27 '17 at 07:45

score 1 · Answer 1 · answered Dec 27 '17 at 07:58

1

You can use DataFrameStatFunctions.approxQuantile to calculate the median,

val medianArray = yourDataFrame.stat.approxQuantile("Amount", Array(0.5), 0)

val median = medianArray(0)

Note :: This operation is optimized for an approximate solution, rather than an accurate one. But we want an accurate solution hence supplied relativeError = 0, this operation can be expensive.

answered Dec 27 '17 at 07:58

sarveshseri

13,738
28
47

I received the error message : error: value approxQuantile is not a member of org.apache.spark.sql.DataFrameStatFunctions Do we need to import any package ? – Sanju Thomas Dec 27 '17 at 08:10
Spark Version ? It was added in Spark 2.0. – sarveshseri Dec 27 '17 at 10:38

score 0 · Answer 2 · answered Dec 27 '17 at 11:04

To get the median, you can use the HIVE UDAF percentile if you have an HiveContext (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)) :

sqlContext.sql("select percentile(Amount, 0.5) from Table1001").show

If performance is an issue, you can also use percentile_approx

To Find Median value of a Data Frame in Apache Spark

2 Answers2