SQL percentile on DataFrame with float numbers Spark 1.6 - any possible workaround?

Question

I try to find a way to calculate percentile 0.25, 0.75 on the Data Frame with float numbers

  sqlContext.sql("SELECT percentile(x, 0.5) FROM df")

as far as I understood from the error that I got, the percentile supports only integer

  AnalysisException: u'No handler for Hive udf class org.apache.hadoop.hive.ql.udf.UDAFPercentile because: No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (float, double). Possible choices: _FUNC_(bigint, array<double>)  _FUNC_(bigint, double)  .; line 1 pos 43'

or I need to use

 sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df")

or use casting

cast(x as bigint)

the both give not the same results, of cause, as I get if calculate the percentile by the pandas on the same float values.

How can I get percentile on Spark 1.6 on the float numbers?

One workaround that I think to multiply the column on any big number (for instans 10000000) and calculate as integer.

Any othre possible solutions or workarounds?

Thanks!

score 0 · Accepted Answer · edited May 23 '17 at 12:16

0

Doing it via SQL if it's not supported, is clearly a workaround that may require more time than simply doing it on the rdd. Sticking to the DataFrame is ok if you can do stuff easily but it has no use to force them to make what you could easily do with an RDD.

If you want to compute a percentile on the RDD, here you can find how: How to compute percentiles in Apache Spark

edited May 23 '17 at 12:16

Community

1
1

answered Jan 15 '17 at 10:36

Chobeat

3,445
6
41
59

SQL percentile on DataFrame with float numbers Spark 1.6 - any possible workaround?

1 Answers1