Lambda Architecture with Apache Spark

Question

I'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for real time (Fast Data) and Cassandra to store the results.

Also, all the datapoints I receive are related to a user session, and therefore, for the batch processing I'm only interested to process the datapoints once the session finishes. So, since I'm using Kafka, the only way to solve this (assuming that all the datapoints are stored in the same topic) is for the batch to fetch all the messages in the topic, and then ignore those that correspond to sessions that have not yet finished.

So, what I'd like to ask is:

Is this a good approach to implement the Lambda Architecture? Or should use Haddop and Storm instead? (I can't find information about people using Kafka and Apache Spark for batch processing, Map Reduce)
Is there a better approach to solve the user sessions problem?

Thanks.

score 5 · Accepted Answer · answered Jul 09 '15 at 21:52

This is a good approach. Using Spark for both the speed and batch layers lets you write the logic once and use it in both contexts.

Concerning your session issue, since you're doing that in batch mode, why not just ingest the data from Kafka into HDFS or Cassandra and then write queries for full sessions there? You could use Spark Streaming's "direct connection" to Kafka to do this.

score 0 · Answer 2 · answered Mar 18 '16 at 15:43

I'm currently working on the same implementation. I use Kafka, HBase, Spark and Spark Streaming.

There's a lot of things to consider when using these technologies and there probably is no simple answer.

Main points to Spark Streaming are that you get a minimum latency of 100 ms for the stream data, as well as one other big gripe for me, the messing up of the ordering of the data consumed by the streaming job. That with a combination of potential stragglers results in a complete lack of confidence that I'm processing the data in at least partial order (to my knowledge, at least). Storm supposedly solves these problems, but I cannot guarantee it since I haven't used it.

In terms of the batch layer, Spark is definitely better than MapReduce as it is faster and more flexible.

Then comes the issue with synchronizing between the Batch and Speed in terms of knowing that where the batch job's data stops the speed continues. I solve that problem by having my speed layer also be the one that puts the data into HBase before doing its processing on it.

This is just a bunch of random points, I hope some of them help.

score 0 · Answer 3 · answered Nov 18 '16 at 11:56

I'll echo Dean Wampler's note that this is a good approach especially if you don't have specific requirements that would steer you away from Spark as the tool of choice for both Batch and Speed layers. To add:

You don't have to re-consume all of the data for a session from a topic before you're able to process it assuming what you're doing with it (your reduction) is an associative operation. Even if it's not associative (like Unique Users) you may still be OK with a highly accurate estimate that can be calculated iteratively like Hyper Log Log. You will likely use some sort of Stateful aggregation. In Spark you can do that either using updateStateByKey, or preferably, mapWithState functions.

If you're looking for concrete examples on specifically the technologies and use cases you mention, I'll point you to the Pluralsight course where you can learn all about it and practice it Applying the Lambda Architecture with Spark, Kafka, and Cassandra

I will also note that if what you're doing is fairly straight forward, and because you're already using Kafka, you may want to consider Kafka Connect for HDFS persistence and Kafka Streams for streaming. You could even use Kafka Streams to stream data right back to Kafka and use Kafka Connect to pipe it out to multiple destinations like Cassandra and ElasticSearch. I mention Kafka Streams because it also carries the ability to hold some state in memory and perform simple streaming operations.

Good luck!

Lambda Architecture with Apache Spark

3 Answers3