I am streaming a lot of data (200k + events / batch of 3sec) from Kafka using KafkaUtils Pyspark implementation.
I receive live data with :
- a
sessionID
- an
ip
- a
state
What I am doing for now with a basic Spark/Redis implementation is the following:
Spark job :
- aggregate the data by
sessionID
:rdd_combined = rdd.map(map_data).combineByKey(lambda x: frozenset([x]), lambda s, v: s | frozenset([v]), lambda s1, s2: s1 | s2)
- create a
set
of the differentstate
( that could be 1, 2, 3...) - keep the
ip
information to then transform it into alon
/lat
. - check if the
sessionID
is in Redis, if yes updates it else writes it to Redis.
Then I run a small script only for Redis in Python that checks if there is a 1 in the state :
- if yes, the event is published in a channel (say
channel_1
) and deleted from Redis. - if not we check / update a timestamp. If
NOW() - timestamp > 10 min
the data is published inchannel_2
or else we do nothing.
Question :
I keep wondering what would be the best implementation to compute most of the work with Spark.
- using
window
+ an aggregation orreduceByKeyAndWindow
: my fear is that if I use a window of 10 minutes and do the computation every 3secs over the almost same data it is not very efficient. - using
updateStateByKey
seems interesting but the data is never deleted and this could become problematic. Also how could I check we are past the 10 minutes ?
Any thoughts on this implementation or others that could be possible ?