Sample dataframe can be created using:
from pyspark.sql.functions import col
from pyspark.sql.window import Window
df = sc.parallelize([['2019-08-29 01:00:00',0],
['2019-08-29 02:00:00',0],
['2019-08-29 03:00:00',0],
['2019-08-29 04:00:00',1],
['2019-08-29 05:00:00',2],
['2019-08-29 06:00:00',3],
['2019-08-29 07:00:00',0],
['2019-08-29 08:00:00',2],
['2019-08-29 09:00:00',0],
['2019-08-29 10:00:00',1]]).toDF(['DATETIME','VAL']).withColumn('DATETIME',col('DATETIME').cast('timestamp'))
I want to produce a column with a count equal to the number of occurrences of 0 values within a 3 hour period (+/- 1 hour of the current time inclusive of the current Val). The window can be created using:
w1 = (Window()
.orderBy(col('DATETIME').cast('long'))
.rangeBetween(-(60*60), 60*60))
Desired outcome:
+-------------------+---+---+
| DATETIME|VAL|NUM|
+-------------------+---+---+
|2019-08-29 01:00:00| 0| 2|
|2019-08-29 02:00:00| 0| 3|
|2019-08-29 03:00:00| 0| 2|
|2019-08-29 04:00:00| 1| 1|
|2019-08-29 05:00:00| 2| 0|
|2019-08-29 06:00:00| 3| 1|
|2019-08-29 07:00:00| 0| 1|
|2019-08-29 08:00:00| 2| 2|
|2019-08-29 09:00:00| 0| 1|
|2019-08-29 10:00:00| 1| 1|
+-------------------+---+---+