Hi i'm planning a deployment where Spark could do the heavy lifting of processing incoming data from Kafka to apply the StreamingKMeans for outlier detection.
However data incoming from the Kafka topic arrives from various sources, defining different data structures that require different KMeans models (states). So potentially every entry in the incoming discrete RDD should pass though its own KMeans model, based on a "key" field (basically i need single event processing).
Can this type of processing be achieved with Spark? If yes, does it exploit Spark parallelism in the end? I'm quite a newbie in Spark and Scala and feel like i'm missing something.
Thanks in advance.
UPDATE:
I'm currently looking into the mapWithState
operator that seems to solve this problem. The question is: can i directly save the StreamingKmeans model into the state? Otherwise i would have to save the centroids and instantiate a new model in the state update function which seems expensive.