0

I am trying to calculate moving average of price for last six months in pyspark.

Currently my table has 6month lagged date.

id  dates         lagged_6month  price
1   2017-06-02    2016-12-02     14.8
1   2017-08-09    2017-02-09     16.65
2   2017-08-16    2017-02-16     16
2   2018-05-14    2017-11-14     21.05
3   2017-09-01    2017-03-01     16.75

Desired Results

id  dates       avg6mprice
 1   2017-06-02  20.6
 1   2017-08-09  21.5
 2   2017-08-16  16.25
 2   2018-05-14  25.05
 3   2017-09-01  17.75

Sample code

from pyspark.sql.functions import col
from pyspark.sql import functions as F
df = sqlContext.table("price_table")
w = Window.partitionBy([col('id')]).rangeBetween(col('dates'),col('lagged_6month'))

RangeBetween does not seem to accept columns as argument in the window function.

gr.kr
  • 63
  • 2
  • 7
  • Possible duplicate of [Spark Window Functions - rangeBetween dates](https://stackoverflow.com/questions/33207164/spark-window-functions-rangebetween-dates) – 10465355 Feb 26 '19 at 20:36
  • @user10465355 here the rangeBetween dates is dynamic and we will not be able to hard-code a value. – gr.kr Feb 26 '19 at 22:06
  • @user10465355 and hence could we not mark this question as a duplicate – gr.kr Feb 26 '19 at 22:06
  • Could you give more detail about the calculation you want to perform? For example, given only the first DataFrame, how would you calculate the result `20.6` in the first row of the second DataFrame? – abeboparebop Feb 27 '19 at 20:30
  • @abeboparebop thanks for responding. I would like to calculate the result 20.6 as the 6 month rolling average of all the calendar days. Example Average of Prices between dates 2017-06-02 to 2016-12-02 – gr.kr Feb 28 '19 at 00:31
  • 1
    If I understand you correctly, this means it is not possible to reproduce that result from the sample data in your question. I suggest you provide sample data that will allow a producible calculation. – abeboparebop Feb 28 '19 at 08:56
  • 1
    I believe you are also misunderstanding the Window function range arguments. It does not take *absolute* values for the range, i.e. exact dates. Instead, it takes relative values as arguments -- e.g. between "180 days ago" and "today." Follow the link in @user10465355's comment for an example. – abeboparebop Feb 28 '19 at 08:57

0 Answers0