Pyspark / Redis - Data aggregation over time

Question

I am streaming a lot of data (200k + events / batch of 3sec) from Kafka using KafkaUtils Pyspark implementation.

I receive live data with :

What I am doing for now with a basic Spark/Redis implementation is the following:

Spark job :

aggregate the data by sessionID : rdd_combined = rdd.map(map_data).combineByKey(lambda x: frozenset([x]), lambda s, v: s | frozenset([v]), lambda s1, s2: s1 | s2)
create a set of the different state ( that could be 1, 2, 3...)
keep the ip information to then transform it into a lon/lat.
check if the sessionID is in Redis, if yes updates it else writes it to Redis.

Then I run a small script only for Redis in Python that checks if there is a 1 in the state :

if yes, the event is published in a channel (say channel_1) and deleted from Redis.
if not we check / update a timestamp. If NOW() - timestamp > 10 min the data is published in channel_2 or else we do nothing.

Question :

I keep wondering what would be the best implementation to compute most of the work with Spark.

using window + an aggregation or reduceByKeyAndWindow : my fear is that if I use a window of 10 minutes and do the computation every 3secs over the almost same data it is not very efficient.
using updateStateByKey seems interesting but the data is never deleted and this could become problematic. Also how could I check we are past the 10 minutes ?

Any thoughts on this implementation or others that could be possible ?

Did you check [this](https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html) — Mario, Nov 30 '21 at 13:55

0 Answers0