Let's imagine we have the following dataframe :
port | flag | timestamp
---------------------------------------
20 | S | 2009-04-24T17:13:14+00:00
30 | R | 2009-04-24T17:14:14+00:00
32 | S | 2009-04-24T17:15:14+00:00
21 | R | 2009-04-24T17:16:14+00:00
54 | R | 2009-04-24T17:17:14+00:00
24 | R | 2009-04-24T17:18:14+00:00
I would like to calculate the number of distinct port, flag
over the 3 hours in Pyspark.
The result will be something like :
port | flag | timestamp | distinct_port_flag_overs_3h
---------------------------------------
20 | S | 2009-04-24T17:13:14+00:00 | 1
30 | R | 2009-04-24T17:14:14+00:00 | 1
32 | S | 2009-04-24T17:15:14+00:00 | 2
21 | R | 2009-04-24T17:16:14+00:00 | 2
54 | R | 2009-04-24T17:17:14+00:00 | 2
24 | R | 2009-04-24T17:18:14+00:00 | 3
The SQL request looks like :
SELECT
COUNT(DISTINCT port) OVER my_window AS distinct_port_flag_overs_3h
FROM my_table
WINDOW my_window AS (
PARTITION BY flag
ORDER BY CAST(timestamp AS timestamp)
RANGE BETWEEN INTERVAL 3 HOUR PRECEDING AND CURRENT
)
I found this topic that solves the problem but only if we want to count distinct elements over one field.
Do someone has any idea of how to achieve that in :
python 3.7
pyspark 2.4.4