How to compute percentiles of all columns in the dataframe in Apache Spark without hive udf

Question

I am using Spark 1.6.1 stand alone cluster with 6 workers(8 cores and 5G executor memory per node).

My dataframe contains 13 columns and rows. I want to take the 99.5th percentile of each column and I used percentile_approx hive UDAF as suggested in https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62wQV68D6J87EVq6AD5-T3D0F3fHjuzs+1C5aCHOUUQS8w@mail.gmail.com%3E . I am trying to collect the percentile values of the 13 columns to a dictionary. The collect operation is showing only 1 task and the task is idle for a long time, after which I killed the job.

PYSPARK CODE:

query=''
for col in mergedKpis.columns[1:]:
      query = query+"percentile_approx("+col+",array(0.005,0.995)) as " +col+","
percentile_dict = sqlContext.sql("select "+query.strip(',')+" from input_table")\
                        .rdd.map(lambda j:j.asDict()).collect()

SCALA CODE:

var query=""
for (col <- mergedKpis.columns.tail) {

       query=query+",percentile_approx("+col+",array(0.005))

}
sqlContext.sql("select "+query.replaceFirst(",","")+" from input_table").collect()

Scala code is also showing the same UI as python code.

I tried running the same query for one column in a 15MB file and it took 6sec,and the time is increasing non-linearly with the file size.

I found a function to find the percentile of an RDD at compute percentile But I cannot convert each column to a RDD and use the computePercentile().

Can anyone tell me how to solve this issue?

Show your code, please. But calculating percentiles is O(n log n) (i.e more than linear), even if you use top (because the number of elements needed for the percentile is proportional to n) and with 13 columns you need to do that 13 times. You should be able to arrange that all thirteen are done in parallel, maybe, but that's as good as it's going to get... — The Archetypal Paul, May 24 '16 at 10:25
I have added scala code also..I want to know why the percentile_approx is not taking all the available cores. Is its implementation not distributed? Do we have any other distributed implementation to find the percentile from a dataframe. — Meethu Mathew, May 25 '16 at 06:19

How to compute percentiles of all columns in the dataframe in Apache Spark without hive udf

0 Answers0