I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles, i.e. percentile 0, percentile 25, etc, on each column of my given data. As I am doing multiple percentiles, how to retrieve each calculated percentile from the results?
My dataframe df
:
+----+---------+-------------+----------+-----------+-------+
| id| date| revenue|con_dist_1| con_dist_2| zone |
+----+---------+-------------+----------+-----------+-------+
| 10|1/15/2018| 0.010680705| 10|0.019875458| east |
| 10|1/15/2018| 0.006628853| 4|0.816039063| west |
| 10|1/15/2018| 0.01378215| 20|0.082049528| east |
| 10|1/15/2018| 0.010680705| 6|0.019875458| west |
| 10|1/15/2018| 0.006628853| 30|0.816039063| east |
+----+---------+-------------+----------+-----------+-------+
The final dataframe should be something as below i.e. for each zone :
+---+---------+-----------+-------+-------------+-----------+-----------+
| id| date| revenue| zone | perctile_col| quantile_0|quantile_10|
+---+---------+-----------+-------+-------------+-----------+-----------+
| 10|1/15/2018|0.010680705| east | con_dist_1 | 10.0| 30.0|
| 10|1/15/2018|0.010680705| east | con_dist_2 |0.019875458|0.816039063|
| 10|1/15/2018|0.010680705| west | con_dist_1 | 4.0| 6.0|
| 10|1/15/2018|0.010680705| west | con_dist_2 |0.019875458|0.816039063|
+---+---------+-----------+-------+-------------+-----------+-----------+
Is there any way to use partitionBy
and the approxQuantile
function?
Will this is processed using repartition("zone")
, i.e., not collecting the dataset for each zone?