Where do Apache Samza and Apache Storm differ in their use cases?

Question

I've stumbled upon this article that purports do contrast Samza with Storm, but it seems only to address implementation details.

Where do these two distributed computation engines differ in their use cases? What kind of job is each tool good for?

score 44 · Answer 1 · answered Aug 05 '15 at 04:29

Well, I've been investigating these systems for a few months, and I don't think they differ profoundly in their use cases. I think it's best to compare them along these lines instead:

Age: Storm is the older project, and the original one in this space, so it's generally more mature and battle-tested. Samza is a newer, second-generation project that seems informed by lessons that were learned from Storm.
Kafka: Samza grew out of the Kafka ecosystem, and is very Kafka-centric. For example, the documentation says that they allow plugging in different messaging systems... as long as they provide similar partitioning, ordering and replay semantics as Kafka does. Storm, being an older system, isn't so specialized to work with Kafka.
Complexity: Samza, partly because it makes stronger assumptions about its environment ("you can have any infrastructure you like as long as it works like Kafka") and partly because it's just newer, strikes me as generally simpler than Storm, in a good way. But one perhaps less good way that Samza is simpler is that it (deliberately?) lacks Storm's concept of topologies (complex execution graphs). If you need a complex, multi-stage processor, it needs to be implemented as independent tasks that communicate through Kafka. This has advantages as well as disadvantages, but Samza makes the choice for you whereas Storm gives you more options.
State management: Many Storm applications need to use an external store like Redis when they need to maintain a large volume of state to process incoming tuples. This situation seems to be one of the main things that motivated Samza's design; one of Samza's most distinctive features is that it provides its tasks with their own local disk-based key/value store to use for this purpose if they need it.

(nb, I'm one of the original Samza developers). This is an excellent and correct summary. Everything that's touched upon here are points I use when people ask me this question. — Jakob Homan, Aug 06 '15 at 21:58

score 22 · Accepted Answer · edited Jun 20 '20 at 09:12

The biggest difference between Apache Storm and Apache Samza comes down to how they stream data to process it.

Apache Storm conducts real-time computation using topology and it gets feed into a cluster where the master node distributes the code among worker nodes that execute it. In topology data is passed in between spouts that spit out data streams as immutable sets of key-value pairs.

Here's Apache Storm's architecture: enter image description here

Apache Samza streams by processing messages as they come in one at a time. The streams get divided into partitions that are an ordered sequence where each has a unique ID. It supports batching and is typically used with Hadoop's YARN and Apache Kafka.

Here's Apache Samza's architecture: enter image description here

Read more about the specific ways each of the systems executes specifics below.

USE CASE

Apache Samza was created by LinkedIn.

A software engineer wrote a post siting:

It's been in production at LinkedIn for several years and currently runs on hundreds of machines across multiple data centers. Our largest Samza job is processing over 1,000,000 messages per-second during peak traffic hours.

Resources Used:

Storm vs. Samza Comparison

Useful Architectural References of Storm and Samza

Thanks for your succinct answer! A few questions remain, however: (1) Am I to understand that Samza has no notion of individual streams? That is to say, is all inbound data lumped together regardless of its source? (2) Am I correct in understanding that samza, by virtue of the fact that it's batch-oriented, is good at running multiple tasks on identical input, whereas Storm is more of a "pipeline" or "cascade" with multiple processing steps? Or am I missing your point altogether? Thanks! — Louis Thibault, Mar 25 '15 at 15:26

Grokify · Answer 3 · 2015-03-22T05:47:07.827

Here's an article by Tony Siciliani that provides a use case (and architecture) comparison for Storm, Spark and Samza. Apache.org links to actual use cases are also provided below.

https://tsicilian.wordpress.com/2015/02/16/streaming-big-data-storm-spark-and-samza/

Regarding use cases for Samza and Storm, he writes:

All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines.

Apache Samza

If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects.

A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale…

Samza use case list: https://cwiki.apache.org/confluence/display/SAMZA/Powered+By

Apache Storm

If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching.

A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel…

Storm use case list: http://storm.apache.org/documentation/Powered-By.html

Where do Apache Samza and Apache Storm differ in their use cases?

3 Answers3

It's been in production at LinkedIn for several years and currently runs on hundreds of machines across multiple data centers. Our largest Samza job is processing over 1,000,000 messages per-second during peak traffic hours.