1

Hi i'm planning a deployment where Spark could do the heavy lifting of processing incoming data from Kafka to apply the StreamingKMeans for outlier detection.

However data incoming from the Kafka topic arrives from various sources, defining different data structures that require different KMeans models (states). So potentially every entry in the incoming discrete RDD should pass though its own KMeans model, based on a "key" field (basically i need single event processing).

Can this type of processing be achieved with Spark? If yes, does it exploit Spark parallelism in the end? I'm quite a newbie in Spark and Scala and feel like i'm missing something.

Thanks in advance.

UPDATE:

I'm currently looking into the mapWithState operator that seems to solve this problem. The question is: can i directly save the StreamingKmeans model into the state? Otherwise i would have to save the centroids and instantiate a new model in the state update function which seems expensive.

Peterdeka
  • 387
  • 5
  • 19

1 Answers1

1

Can this type of processing be achieved with Spark? If yes, does it exploit Spark parallelism in the end?

Theoretically this type of processing is possible and it can benefit from distributed processing, but definitely not with the tools you want to use.

StreamingKMeans is a model which is designed to work on RDDs and since Spark doesn't support nested transformations you cannot use it inside stateful transformations.

If set of keys has low cardinality and all values are known up front you could split RDDs by key and keep separate model per key.

If not, you can replace StreamingKMeans with 3-rd party local and serializable K-means model and used with combination with mapWithState or updateStateByKey. In general it should be much more efficient than using distributed versions without reducing overall parallelism.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks zero, for generality the keys are unpredictable at this application layer as they depend on the source the data comes from, and Kafka sources are added at runtime by another app layer. Is your advice of going with the 3-rd party option (any?) or move to Flink? As Flink seems more suitable to this case in my opinion... – Peterdeka Jul 22 '16 at 10:39
  • 1
    Oh maybe i misunderstood, with the 3-rd party thing you meant not using Spark and friends right?! That was my first option... :D – Peterdeka Jul 22 '16 at 10:49
  • I am biased here. I tried Flink and I didn't like API design. Ignoring that, AFAIK it doesn't provide any methods that can address this particular scenario. I could be wrong though. Regarding local libs - Elki is decent in general although docs are far from great. – zero323 Jul 22 '16 at 10:55
  • 1
    In Flink i could probably use CoFlatMapFunction that is also useful if i want to edit the model in some way through command messages over a command stream. Regarding the custom solution i use GO so i've got many choices, the doubt remains when talking about scaling to multiple instances that will require persisting Kakfa offsets and most of all sharing and updating the KMeans model state among instances. – Peterdeka Jul 22 '16 at 11:03
  • However i still don't get why inside the mapWithState update function i cannot apply the kmeans to the single RDD object (it's a map so i should get the single object and its state by key) https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html – Peterdeka Jul 22 '16 at 11:11
  • This will be a subjective opinion but as much as I like working with Spark it is not a great tool when you need fine-grained control. In theory its architecture is generic enough to be adjusted for arbitrary scenario but in practice it is still batch-first solution. Regarding map with state a) trainOn takes DStream and requires a transformation b) mapWithState is a transformation (more or less) and takes single element at the time. – zero323 Jul 22 '16 at 11:21