2

I am streaming a lot of data (200k + events / batch of 3sec) from Kafka using KafkaUtils Pyspark implementation.

I receive live data with :

  • a sessionID
  • an ip
  • a state

What I am doing for now with a basic Spark/Redis implementation is the following:

Spark job :

  • aggregate the data by sessionID : rdd_combined = rdd.map(map_data).combineByKey(lambda x: frozenset([x]), lambda s, v: s | frozenset([v]), lambda s1, s2: s1 | s2)
  • create a set of the different state ( that could be 1, 2, 3...)
  • keep the ip information to then transform it into a lon/lat.
  • check if the sessionID is in Redis, if yes updates it else writes it to Redis.

Then I run a small script only for Redis in Python that checks if there is a 1 in the state :

  • if yes, the event is published in a channel (say channel_1) and deleted from Redis.
  • if not we check / update a timestamp. If NOW() - timestamp > 10 min the data is published in channel_2 or else we do nothing.

Question :

I keep wondering what would be the best implementation to compute most of the work with Spark.

  • using window + an aggregation or reduceByKeyAndWindow : my fear is that if I use a window of 10 minutes and do the computation every 3secs over the almost same data it is not very efficient.
  • using updateStateByKey seems interesting but the data is never deleted and this could become problematic. Also how could I check we are past the 10 minutes ?

Any thoughts on this implementation or others that could be possible ?

Guy Korland
  • 9,139
  • 14
  • 59
  • 106
Orelus
  • 963
  • 1
  • 13
  • 23

0 Answers0