Spark SQL : is there a way to get a sliding window whose size depends on a time duration instead of a number of items?

Question

I have a Spark Dataset of events, indexed by a timestamp. What I would like to do is enrich each entry with additional information : the number of events occuring in the five minutes (300 seconds) following this event. So if initial data consists of two columns event_id and timestamp, I want to build a third columnn counter like below :

    event_id  timestamp counter
    0         0         4
    1         100       3
    2         150       2
    3         250       1
    4         275       0
    5         600       2
    6         610       1
    7         750       1
    8         950       2
    9         1100      1
    10        1200      0

I know that using Spark I can use windows to count future events within a window of fixed size in term of number of events.

val window = Window.orderBy('timestamp).rowsBetween(0, 300)
myDataset.withColumn("count_future_events", sum(lit(1)).over(window))

But this is not interesting because the result is obviously always the same.

I wish something like this existed :

val window = Window.orderBy('timestamp).rowsBetween('timestamp, 'timestamp + 300) // 300 seconds here

But this does not compile.

Is there any way to achieve what I want ?

Does this answer your question? [What is the difference between rowsBetween and rangeBetween?](https://stackoverflow.com/questions/40048919/what-is-the-difference-between-rowsbetween-and-rangebetween) — user10938362, Feb 27 '20 at 15:22
Also [Spark Window Functions - rangeBetween dates](https://stackoverflow.com/q/33207164/10938362) — user10938362, Feb 27 '20 at 15:51

score 0 · Accepted Answer · answered Mar 23 '20 at 08:18

Have you got an answer?

import org.apache.spark.sql.expressions.Window

val w = Window.orderBy("timestamp").rangeBetween(0, 300)

df.withColumn("counter", sum(lit(1)).over(w) - 1).show(false)

You can simply use the rangeBetween for the Window. The result is then:

+--------+---------+-------+
|event_id|timestamp|counter|
+--------+---------+-------+
|0       |0        |4      |
|1       |100      |3      |
|2       |150      |2      |
|3       |250      |1      |
|4       |275      |0      |
|5       |600      |2      |
|6       |610      |1      |
|7       |750      |1      |
|8       |950      |2      |
|9       |1100     |1      |
|10      |1200     |0      |
+--------+---------+-------+

Spark SQL : is there a way to get a sliding window whose size depends on a time duration instead of a number of items?

1 Answers1