4

Normally rangeBetween works well if we know the number of days. For example, the following window function:

def days(i:Int):Long = i*86400
val w = Window.partitionBy(col("Store")).orderBy(col("mth").cast("timestamp").cast("long"))
   .rangeBetween(days(7), 0)

How can we expand this to months, as number of days in a month is not constant? Can I use any month function in rangeBetween to solve the problem?

Solution for this Problen

df.createOrReplaceTempView("df")

spark.sql(
    """SELECT *, mean(some_value) OVER (
        PARTITION BY id 
        ORDER BY CAST(start AS timestamp) 
        RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
     ) AS mean FROM df""").show()
loneStar
  • 3,780
  • 23
  • 40
  • 1
    If you have Spark 2.3 or newer it's possible that the SQL API works, see: https://stackoverflow.com/questions/33207164/spark-window-functions-rangebetween-dates (the answer mentions days but you can try substituting with month). – Shaido Jun 04 '19 at 02:56
  • @Shaido In that answer he defined the lambda functions days in the above, how can define months functions, as there are no constant days in a month. In my code also i defined that function. I need something for months. – loneStar Jun 04 '19 at 03:15
  • Yes, so that part won't work. That's why I mentioned it requires a Spark version >= 2.3 which allows usage like "RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW" (here you can try substitute with month). – Shaido Jun 04 '19 at 03:26
  • 1
    Cool it worked out. – loneStar Jun 07 '19 at 16:03
  • Happy it worked out for you :) – Shaido Jun 07 '19 at 16:14
  • Cold you share how did you do it? I have precisely the same problem, but I cannot figure out how to implement this in Scala. – Lucas Lima Feb 06 '20 at 14:57
  • @LucasLima Solution to the problem is updated in the question itself – loneStar Feb 06 '20 at 15:07
  • Oh, that was the only way, then. I was expecting to be able to build a window object in Scala, not sneaking some SQL query in the code. This doesn't quite work for me - I'll have to try something else. Thank you. Also, in the question, you are using SQL to create a days window, but I don't see how does that translate to a months window - as you stated yourself, the amount of days vary among different months. – Lucas Lima Feb 06 '20 at 15:32
  • 2
    @LucasLima OVER (PARTITION BY otlet ORDER BY CAST(month AS timestamp) RANGE BETWEEN INTERVAL 59 MONTHS PRECEDING AND CURRENT ROW) – loneStar Feb 06 '20 at 19:09
  • Does somebody know how to implement this solely using PySpark and not introducing SQL code? – Tinkerbell Sep 14 '22 at 08:07

0 Answers0