Let's say I have the following Spark frame:
+-------------------+--------+
|timestamp |UserName|
+-------------------+--------+
|2021-08-11 04:05:06|A |
|2021-08-11 04:15:06|B |
|2021-08-11 09:15:26|A |
|2021-08-11 11:04:06|B |
|2021-08-11 14:55:16|A |
|2021-08-13 04:12:11|B |
+-------------------+--------+
I want to build time-series data for desired time resolution based on events counts for each user.
- Note1: obliviously after groupbying on
UserName
& counting based on desired time frame\resolution, time frames need to be kept with spark frame. (maybe use of Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming ) - Note2: needs to fill the missing gap for a specific time frame and replace 0 if there are no events.
- Note3: I'm not interested in using
UDF
or hacking it viatoPandas()
.
So let's say for 24hrs (daily) time frame expected results should be like below after groupBy:
+------------------------------------------+-------------+-------------+
|window_frame_24_Hours | username A | username B |
+------------------------------------------+-------------+-------------+
|{2021-08-11 00:00:00, 2021-08-11 23:59:59}|3 |2 |
|{2021-08-12 00:00:00, 2021-08-12 23:59:59}|0 |0 |
|{2021-08-13 00:00:00, 2021-08-13 23:59:59}|0 |1 |
+------------------------------------------+-------------+-------------+
Edit1: in case of 12hrs time frame\resolution:
+------------------------------------------+-------------+-------------+
|window_frame_12_Hours | username A | username B |
+------------------------------------------+-------------+-------------+
|{2021-08-11 00:00:00, 2021-08-11 11:59:59}|2 |2 |
|{2021-08-11 12:00:00, 2021-08-11 23:59:59}|1 |0 |
|{2021-08-12 00:00:00, 2021-08-12 11:59:59}|0 |0 |
|{2021-08-12 12:00:00, 2021-08-12 23:59:59}|0 |0 |
|{2021-08-13 00:00:00, 2021-08-13 11:59:59}|0 |1 |
|{2021-08-13 12:00:00, 2021-08-13 23:59:59}|0 |0 |
+------------------------------------------+-------------+-------------+