I'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for real time (Fast Data) and Cassandra to store the results.
Also, all the datapoints I receive are related to a user session, and therefore, for the batch processing I'm only interested to process the datapoints once the session finishes. So, since I'm using Kafka, the only way to solve this (assuming that all the datapoints are stored in the same topic) is for the batch to fetch all the messages in the topic, and then ignore those that correspond to sessions that have not yet finished.
So, what I'd like to ask is:
- Is this a good approach to implement the Lambda Architecture? Or should use Haddop and Storm instead? (I can't find information about people using Kafka and Apache Spark for batch processing, Map Reduce)
- Is there a better approach to solve the user sessions problem?
Thanks.