Is there a clean way to compute moving percentiles on a Spark Dataframe.
I have a huge dataframe, I'm aggregating it every 15 minutes and I would like to compute percentiles on each portion.
df.groupBy(window(col("date").cast("timestamp"), "15 minutes"))
.agg(sum("session"),mean("session"),percentile_approx("session", 0.5))
.show()
error: not found: value percentile_approx
So I have to compute basic things like sum and average but I need to compute the median and some others percentiles.
Is there an efficient way to do this in Spark 2.1 ?
Because here, there is no median, percentile_approx, Percentile_approx functions implemented in the API it seems.
I saw this question has already been asked, but the answers weren't all agreeing to an unique solution. And it was quite fuzzy for me... So I wanted to know if in August 2017, there was a good and efficient solution.
And as I go through windows of 15 minutes, I'm wondering if just hard computing it wouldn't work rather than an approximation ?
Thanks a lot for your attention,
Have a good afternoon !
PS : Scala or PySpark I don't mind, both would be even greater !