0

I am using spark-sql-2.4.1v with Java 8. I need to calculate percentiles such as 25,75,90 for some given data.

Given source dataset:

val df = Seq(
    (10, "1/15/2018", 0.010680705, 10, 0.619875458, 0.010680705, "east"),
    (10, "1/15/2018", 0.006628853,  4, 0.16039063, 0.01378215, "west"),
    (10, "1/15/2018", 0.01378215,  20, 0.082049528, 0.010680705, "east"),
    (10, "1/15/2018", 0.810680705,  6, 0.819875458, 0.702228853, "west"),
    (10, "1/15/2018", 0.702228853, 30, 0.916039063, 0.810680705, "east"),
    (11, "1/15/2018", 0.010680705, 10, 0.619875458, 0.010680705, "east"),
    (11, "1/15/2018", 0.006628853,  4, 0.16039063, 0.01378215,  "west"),
    (11, "1/15/2018", 0.01378215,  20, 0.082049528, 0.010680705, "east"),
    (11, "1/15/2018", 0.810680705,  6, 0.819875458, 0.702228853, "west"),
    (11, "1/15/2018", 0.702228853, 30, 0.916039063, 0.810680705,  "east"))  
  .toDF("id", "date", "revenue", "con_dist_1", "con_dist_2", "con_dist_3", "zone")


val percentiles = Seq(0.25, 0.75,0.90)  // Which percentiles to calculate
val cols = Seq("con_dist_1", "con_dist_2")  // The columns to use


+---+---------+-----------+----------+-----------+-----------+----+
| id|     date|    revenue|con_dist_1| con_dist_2| con_dist_3|zone|
+---+---------+-----------+----------+-----------+-----------+----+
| 10|1/15/2018|0.010680705|        10|0.619875458|0.010680705|east|
| 10|1/15/2018|0.006628853|         4| 0.16039063| 0.01378215|west|
| 10|1/15/2018| 0.01378215|        20|0.082049528|0.010680705|east|
| 10|1/15/2018|0.810680705|         6|0.819875458|0.702228853|west|
| 10|1/15/2018|0.702228853|        30|0.916039063|0.810680705|east|
| 11|1/15/2018|0.010680705|        10|0.619875458|0.010680705|east|
| 11|1/15/2018|0.006628853|         4| 0.16039063| 0.01378215|west|
| 11|1/15/2018| 0.01378215|        20|0.082049528|0.010680705|east|
| 11|1/15/2018|0.810680705|         6|0.819875458|0.702228853|west|
| 11|1/15/2018|0.702228853|        30|0.916039063|0.810680705|east|
+---+---------+-----------+----------+-----------+-----------+----+

I need to calculate the given percentiles for each zone for the given columns. How can this be achieved?

expected results

+----+---------+-------------+-------------+------------+-------------+
|  id|     date|      revenue| perctile_col| quantile_0 |quantile_10  |
+----+---------+-------------+-------------+------------+-------------+
|  10|1/15/2018|  0.010680705| con_dist_1  |<quant0_val>|<quant10_val>|
|  10|1/15/2018|  0.010680705| con_dist_2  |<quant0_val>|<quant10_val>|
|  10|1/15/2018|  0.006628853| con_dist_1  |<quant0_val>|<quant10_val>|
|  10|1/15/2018|  0.006628853| con_dist_2  |<quant0_val>|<quant10_val>|
|  10|1/15/2018|   0.01378215| con_dist_1  |<quant0_val>|<quant10_val>|
|  10|1/15/2018|   0.01378215| con_dist_2  |<quant0_val>|<quant10_val>|
|  10|1/15/2018|  0.010680705| con_dist_1  |<quant0_val>|<quant10_val>|
|  10|1/15/2018|  0.010680705| con_dist_2  |<quant0_val>|<quant10_val>|
|  10|1/15/2018|  0.006628853| con_dist_1  |<quant0_val>|<quant10_val>|
|  10|1/15/2018|  0.006628853| con_dist_2  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|  0.010680705| con_dist_1  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|  0.010680705| con_dist_2  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|  0.006628853| con_dist_1  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|  0.006628853| con_dist_2  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|   0.01378215| con_dist_1  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|   0.01378215| con_dist_2  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|  0.010680705| con_dist_1  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|  0.010680705| con_dist_2  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|  0.006628853| con_dist_1  |<quant0_val>|<quant10_val>|
|  11|1/15/2018|  0.006628853| con_dist_2  |<quant0_val>|<quant10_val>|
+----+---------+-------------+-------------+------------+-------------+

Note: quantile_0 and etc here are percentile 0 , 50, etc.

BdEngineer
  • 2,929
  • 4
  • 49
  • 85
  • check this post - https://stackoverflow.com/questions/62907356/in-spark-is-their-any-alternative-for-union-function-while-appending-new-row/63090531?noredirect=1#comment111611897_63090531 – Srinivas Jul 28 '20 at 05:51
  • what is ```quantile_0``` & ```quantile_10``` columns ?? – Srinivas Jul 28 '20 at 12:51
  • @Srinivas , quantile_0 , quantile_10 and etc here are percentile 0 , 10, etc. – BdEngineer Jul 28 '20 at 13:03
  • @Srinivas, thanks a lot , i have another use case like this , any advice please https://stackoverflow.com/questions/63137437/doing-multiple-column-value-look-up-after-joining-with-lookup-dataset – BdEngineer Jul 28 '20 at 15:17

0 Answers0