0

I'm working on a problem where I have imported a DB table into Apache Spark.

I have converted it into a DataFrame. Then I performed a RegisterTempTable so that I can use Hive Queries on it.

I'm able to perform other mathematical operations like,

sqlContext.sql("select avg(Amount) from Table1001").show

However I'm unable to find the median for a field called Amount. Is there any way to find the median on this DataFrame?

Kindly provide a suitable solution.

sarveshseri
  • 13,738
  • 28
  • 47
Sanju Thomas
  • 1
  • 1
  • 1

2 Answers2

1

You can use DataFrameStatFunctions.approxQuantile to calculate the median,

val medianArray = yourDataFrame.stat.approxQuantile("Amount", Array(0.5), 0)

val median = medianArray(0)

Note :: This operation is optimized for an approximate solution, rather than an accurate one. But we want an accurate solution hence supplied relativeError = 0, this operation can be expensive.

sarveshseri
  • 13,738
  • 28
  • 47
0

To get the median, you can use the HIVE UDAF percentile if you have an HiveContext (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)) :

sqlContext.sql("select percentile(Amount, 0.5) from Table1001").show

If performance is an issue, you can also use percentile_approx

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145