0

I am using Spark 1.6.1 stand alone cluster with 6 workers(8 cores and 5G executor memory per node).

My dataframe contains 13 columns and rows. I want to take the 99.5th percentile of each column and I used percentile_approx hive UDAF as suggested in https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62wQV68D6J87EVq6AD5-T3D0F3fHjuzs+1C5aCHOUUQS8w@mail.gmail.com%3E . I am trying to collect the percentile values of the 13 columns to a dictionary. The collect operation is showing only 1 task and the task is idle for a long time, after which I killed the job.

PYSPARK CODE:

query=''
for col in mergedKpis.columns[1:]:
      query = query+"percentile_approx("+col+",array(0.005,0.995)) as " +col+","
percentile_dict = sqlContext.sql("select "+query.strip(',')+" from input_table")\
                        .rdd.map(lambda j:j.asDict()).collect()

enter image description here

SCALA CODE:

var query=""
for (col <- mergedKpis.columns.tail) {

       query=query+",percentile_approx("+col+",array(0.005))

}
sqlContext.sql("select "+query.replaceFirst(",","")+" from input_table").collect()

Scala code is also showing the same UI as python code.

I tried running the same query for one column in a 15MB file and it took 6sec,and the time is increasing non-linearly with the file size.

I found a function to find the percentile of an RDD at compute percentile But I cannot convert each column to a RDD and use the computePercentile().

Can anyone tell me how to solve this issue?

Community
  • 1
  • 1
Meethu Mathew
  • 431
  • 1
  • 6
  • 15
  • 1
    Show your code, please. But calculating percentiles is O(n log n) (i.e more than linear), even if you use top (because the number of elements needed for the percentile is proportional to n) and with 13 columns you need to do that 13 times. You should be able to arrange that all thirteen are done in parallel, maybe, but that's as good as it's going to get... – The Archetypal Paul May 24 '16 at 10:25
  • plz find the code added with the qn – Meethu Mathew May 24 '16 at 12:24
  • This isn't scala. Removing the tag – The Archetypal Paul May 24 '16 at 13:59
  • http://stackoverflow.com/q/31432843 – zero323 May 24 '16 at 17:06
  • I have added scala code also..I want to know why the percentile_approx is not taking all the available cores. Is its implementation not distributed? Do we have any other distributed implementation to find the percentile from a dataframe. – Meethu Mathew May 25 '16 at 06:19

0 Answers0