I am using Spark 1.6.1 stand alone cluster with 6 workers(8 cores and 5G executor memory per node).
My dataframe contains 13 columns and rows. I want to take the 99.5th percentile of each column and I used percentile_approx hive UDAF as suggested in https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62wQV68D6J87EVq6AD5-T3D0F3fHjuzs+1C5aCHOUUQS8w@mail.gmail.com%3E . I am trying to collect the percentile values of the 13 columns to a dictionary. The collect operation is showing only 1 task and the task is idle for a long time, after which I killed the job.
PYSPARK CODE:
query=''
for col in mergedKpis.columns[1:]:
query = query+"percentile_approx("+col+",array(0.005,0.995)) as " +col+","
percentile_dict = sqlContext.sql("select "+query.strip(',')+" from input_table")\
.rdd.map(lambda j:j.asDict()).collect()
SCALA CODE:
var query=""
for (col <- mergedKpis.columns.tail) {
query=query+",percentile_approx("+col+",array(0.005))
}
sqlContext.sql("select "+query.replaceFirst(",","")+" from input_table").collect()
Scala code is also showing the same UI as python code.
I tried running the same query for one column in a 15MB file and it took 6sec,and the time is increasing non-linearly with the file size.
I found a function to find the percentile of an RDD at compute percentile But I cannot convert each column to a RDD and use the computePercentile().
Can anyone tell me how to solve this issue?