Compare in-memory cluster computing systems

Question

I am working on Spark(Berkeley) Cluster Computing System. On my research, I learnt about some other in-memory systems like Redis, Memcachedb etc. It would be great if someone could give me a comparison between SPARK and REDIS (and MEMCACHEDB). In what scenarios does Spark have an advantage over these other in-memory systems?

Didier Spezia · Accepted Answer · 2013-05-22T10:50:16.187

They are complete different beasts.

Redis and memcachedb are distributed stores. Redis is a pure in-memory system with optional persistency featuring various data structures. Memcachedb provides a memcached API on top of Berkeley-DB. In both cases, they are more likely to be used by OLTP applications, or eventually, for simple real-time analytics (on-the-fly aggregation of data).

Both Redis and memcachedb lack mechanisms to efficiently iterate on the stored data in parallel. You cannot easily scan and apply some processing to the stored data. They are not designed for this. Also, except by using client-side manual sharding, they cannot be scaled out in a cluster (a Redis cluster implementation is on-going though).

Spark is a system to expedite large scale analytics jobs (and especially the iterative ones) by providing in-memory distributed datasets. With Spark, you can implement efficient iterative map/reduce jobs on a cluster of machines.

Redis and Spark both rely on in-memory data management. But Redis (and memcached) play in the same ballpark as the other OLTP NoSQL stores, while Spark is rather similar to an Hadoop map/reduce system.

Redis is good at running numerous fast storage/retrieval operations at a high throughput with sub-millisecond latency. Spark shines at implementing large scale iterative algorithms for machine learning, graph analysis, interactive data mining, etc ... on a significant volume of data.

Update: additional question about Storm

The question is to compare Spark to Storm (see comments below).

Spark is still based on the idea that, when the existing data volume is huge, it is cheaper to move the process to the data, rather than moving the data to the process. Each node stores (or caches) its dataset, and jobs are submitted to the nodes. So the process moves to the data. It is very similar to Hadoop map/reduce, except memory storage is aggressively used to avoid I/Os which makes it efficient for iterative algorithms (when the output of the previous step is the input of the next step). Shark is only a query engine built on top of Spark (supporting ad-hoc analytical queries).

You can see Storm as the complete architectural opposite of Spark. Storm is a distributed streaming engine. Each node implements a basic process, and data items flow in/out a network of interconnected nodes (contrary to Spark). With Storm, the data move to the process.

Both frameworks are used to parallelize computations of massive amount of data.

However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).

Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.

What about Storm? How can you compare it with Spark (or Shark)? — void, May 22 '13 at 09:52
Thanks for the update.There's one more question if you don't mind. Spark has the tool 'Spark Streaming' for real-time analysis. Is it comparable to Storm(another real-time analysis tool)? Are there any advantages for it? — void, May 22 '13 at 12:15
Spark Streaming module is comparable to Storm (both are streaming engines). They work differently though. Spark Streaming accumulates batches of data, and then submit these batches to the Spark engine as if they were immutable Spark datasets. Storm processes and dispatch items as soon as they are received. I don't know which one is the most efficient in term of throughput. In term of latency, it is probably Storm. — Didier Spezia, May 22 '13 at 14:44
Is there any company using Spark streaming in production? Is there a stable version released? — void, Jun 07 '13 at 07:09
I want to simulate real-time data and process it using Spark-streaming (e.g. word count) and obtain the output in a real-time graph. The graph can be built using HTML, PHP preferably. Can you guide me how to do this? I am able to input data through netcat server and obtain the output through spark-streaming. Now, the challenge is to simulate real-time data and then plot a graph for the operation. — void, Jun 17 '13 at 07:08
Stack Overflow is about answering specific programming questions. You will be better served by posting to the spark-users mailing-list (which you already did). — Didier Spezia, Jun 17 '13 at 09:37
Here is my 2 cents: Spark streaming has concept of sliding window while in Storm you have to maintain the window by yourself — freevictor, Feb 14 '14 at 09:37
Batch processing in Storm is done with Trident. Therefore, comparing Trident with Spark Streaming would be more fitting. — Martin Tapp, Mar 26 '14 at 12:53

Compare in-memory cluster computing systems

1 Answers1