How to calculate rolling median in PySpark using Window()?

Question

How do I calculate rolling median of dollar for a window size of previous 3 values?

Input data

dollars timestampGMT       
25      2017-03-18 11:27:18
17      2017-03-18 11:27:19
13      2017-03-18 11:27:20
27      2017-03-18 11:27:21
13      2017-03-18 11:27:22
43      2017-03-18 11:27:23
12      2017-03-18 11:27:24

Expected Output data

dollars timestampGMT          rolling_median_dollar
25      2017-03-18 11:27:18   median(25)
17      2017-03-18 11:27:19   median(17,25)
13      2017-03-18 11:27:20   median(13,17,25)
27      2017-03-18 11:27:21   median(27,13,17)
13      2017-03-18 11:27:22   median(13,27,13)
43      2017-03-18 11:27:23   median(43,13,27)
12      2017-03-18 11:27:24   median(12,43,13)

Below code does moving avg but PySpark doesn't have F.median().

pyspark: rolling average using timeseries data

EDIT 1: The challenge is median() function doesn't exit. I cannot do

df = df.withColumn('rolling_average', F.median("dollars").over(w))

If I wanted moving average I could have done

df = df.withColumn('rolling_average', F.avg("dollars").over(w))

EDIT 2: Tried using approxQuantile()

windfun = Window().partitionBy().orderBy(F.col(date_column)).rowsBetwe‌en(-3, 0) sdf.withColumn("movingMedian", sdf.approxQuantile(col='a', probabilities=[0.5], relativeError=0.00001).over(windfun))

But getting error

AttributeError: 'list' object has no attribute 'over'

EDIT 3

Please give solution without Udf since it won't benefit from catalyst optimization.

Did you try to order by `timestampGMT` and do the calculation over the rows per window? Just curious what the problem is (and wonder if implementation of median might be the one). — Jacek Laskowski, Oct 16 '17 at 10:23
Seen `df.stat.approxQuantile` and https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html? — Jacek Laskowski, Oct 16 '17 at 11:03

score 12 · Answer 1 · edited Jan 14 '19 at 16:19

12

One way is to collect the $dollars column as a list per window, and then calculate the median of the resulting lists using an udf:

from pyspark.sql.window import Window
from pyspark.sql.functions import *
import numpy as np 
from pyspark.sql.types import FloatType

w = (Window.orderBy(col("timestampGMT").cast('long')).rangeBetween(-2, 0))
median_udf = udf(lambda x: float(np.median(x)), FloatType())

df.withColumn("list", collect_list("dollars").over(w)) \
  .withColumn("rolling_median", median_udf("list")).show(truncate = False)

+-------+---------------------+------------+--------------+
|dollars|timestampGMT         |list        |rolling_median|
+-------+---------------------+------------+--------------+
|25     |2017-03-18 11:27:18.0|[25]        |25.0          |
|17     |2017-03-18 11:27:19.0|[25, 17]    |21.0          |
|13     |2017-03-18 11:27:20.0|[25, 17, 13]|17.0          |
|27     |2017-03-18 11:27:21.0|[17, 13, 27]|17.0          |
|13     |2017-03-18 11:27:22.0|[13, 27, 13]|13.0          |
|43     |2017-03-18 11:27:23.0|[27, 13, 43]|27.0          |
|12     |2017-03-18 11:27:24.0|[13, 43, 12]|13.0          |
+-------+---------------------+------------+--------------+

edited Jan 14 '19 at 16:19

zero323

322,348
103
959
935

answered Oct 16 '17 at 12:55

mtoto

23,919
4
58
71

1

Thanks. But can we do it without Udf since it won't benefit from catalyst optimization? – GeorgeOfTheRF Oct 16 '17 at 13:01
there is no native Spark alternative I'm afraid. – mtoto Oct 16 '17 at 13:06
What about using percentRank() with window function? I read somewhere but code was not given. Does that ring a bell? – GeorgeOfTheRF Oct 16 '17 at 13:17
1

you are not partitioning your data, so percent_rank() would only give you the percentiles according to **all** timestamps. – mtoto Oct 16 '17 at 13:27
Will percentRank give median? I am defining range between so that till limit for previous 3 rows – GeorgeOfTheRF Oct 16 '17 at 13:36
according to the docs: *" returns the relative rank (i.e. percentile) of rows within a window partition.* So it won't work as you expect because you have no partitions. – mtoto Oct 16 '17 at 13:40
Wouldn't it be possible to add the 'list' column and then partitionBy('list')? – plalanne Oct 16 '17 at 13:46
you can try, but I'm fairly certain that it will be slower than the above – mtoto Oct 16 '17 at 13:49
By default entire data will be considered as a single partition. It's not Mandatory to specify partition – GeorgeOfTheRF Oct 16 '17 at 13:49
Is there a way to perform this same calculation with a groupBy? – John Stud Nov 16 '20 at 17:36

score 1 · Answer 2 · answered Feb 14 '23 at 16:01

Another way without using any udf is to use the expr from the pyspark.sql.functions

dict = [{'dollars': 25,'timestampGMT': '2017-03-18 11:27:18'},
        {'dollars': 17,'timestampGMT': '2017-03-18 11:27:19'},
        {'dollars': 13,'timestampGMT': '2017-03-18 11:27:20'},
        {'dollars': 27,'timestampGMT': '2017-03-18 11:27:21'},
        {'dollars': 13,'timestampGMT': '2017-03-18 11:27:22'},
        {'dollars': 43,'timestampGMT': '2017-03-18 11:27:23'},
        {'dollars': 12,'timestampGMT': '2017-03-18 11:27:24'}
       ]

test = spark.createDataFrame(dict,schema=['dollars','timestampGMT'])

test.withColumn("id", F.lit(1)).withColumn(
    "rolling_median_dollar",
    F.expr("percentile(dollars,0.5)").over(
        W.partitionBy("id")
        .orderBy(F.col("timestampGMT").cast("long"))
        .rowsBetween(-2, 0)
    ),
).drop('id').show()

+-------+-------------------+---------------------+
|dollars|       timestampGMT|rolling_median_dollar|
+-------+-------------------+---------------------+
|     25|2017-03-18 11:27:18|                 25.0|
|     17|2017-03-18 11:27:19|                 21.0|
|     13|2017-03-18 11:27:20|                 17.0|
|     27|2017-03-18 11:27:21|                 17.0|
|     13|2017-03-18 11:27:22|                 13.0|
|     43|2017-03-18 11:27:23|                 27.0|
|     12|2017-03-18 11:27:24|                 13.0|
+-------+-------------------+---------------------+

Thank you for the fresh new answer! – Ric S Mar 06 '23 at 11:19 — Ric S, Mar 06 '23 at 11:19

How to calculate rolling median in PySpark using Window()?

2 Answers2

Linked