2

i have use case where i need to do a calculation of a percentile on a column (let's call it X) over a sliding window. so the window definition is chronological - last 120 days:

days = lambda i: i * 86400
w = Window.partitionBy("entityId").orderBy(F.col("trn_time").cast("long").asc())
    .rangeBetween(-days(120),-days(1))

i thought on using approxQuantile but it is a Dataframe function . second option is using :

percent_rank().over(w)

but i need to sort the window by the numeric column (X) that i want to do the percentile on , and the window is already sorted by time. when i try to add X to the orderBY in the window definition :

w = Window.partitionBy("entityId").orderBy(F.col("trn_time").cast("long").asc(),"X")\
    .rangeBetween(-days(120),-days(1))

i get the following error : "A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions"

how can i implement this logic ?

user1450410
  • 191
  • 1
  • 13
  • you could try to sort df by time and X using window function with `row_number()` and then sort by that row number in your final calculation instead of time and X – chlebek Jun 30 '20 at 06:37

1 Answers1

4

You need to write it as a inbuilt SQL expression:

# This is like a UDF
magic_percentile = F.expr('percentile_approx(X, 0.5)')
# Define your window
w = Window.partitionBy("entityId").orderBy(F.col("trn_time").cast("long").asc())
    .rangeBetween(-days(120),-days(1))

df = df.withColumn("rolling_percentile", magic_percentile.over(w))

When calculating percentile, you always order the values from smallest to largest and then take the quantile value, so the values within your window will be sorted.

Reference: Median / quantiles within PySpark groupBy

pissall
  • 7,109
  • 2
  • 25
  • 45
  • thanks @pissall , but when is the percentile is ordering the values, this is the missing part, in the example you shared it just do the percentile_approx over X , without ordering X. the window is order by time. – user1450410 Jun 30 '20 at 10:21
  • @user1450410 The ordering will happen inside the window function, and you will not be able to see it. – pissall Jul 01 '20 at 03:40