Streaming Big Data - where to store intermediate results?

Question

I am working on spark streaming job that requires to store intermediate results in order to reuse them in next window stream. Number of data is extremely large so probably there is no way to store it in spark cache. What is more I need in someway to read data by some 'key'. I was thinking about Cassandra as intermediate storage but it also has some drawbacks. Alternatively, maybe Kafka will be do the job but it will require additional work in order to select given portion of data by key.

Could you advise me what I should do? How such problems are resolved in Storm - is there any internal mechanism or it is preferred to use some external tools?

in Storm you will have to save the intermediary results in memory (or persist them into a database). Oh, and you will have to implement the logic to deal with your design decision. — SQL.injection, Jun 18 '15 at 10:39

score 3 · Answer 1 · edited May 23 '17 at 11:46

Solr as Index + Cassandra as NoSQL storage working fine for my use case where I have to process tera bytes of data. But in my case, I am using Cassandra for persistent storage of years of data.

Kafka is working fine as a replacement Jboss/AMQ due to it's simple architecture. Currently I am working Apache Storm + Kafka for real time stream processing in one of the projects.

Since you are storing intermediate data, I think Kafka is best choice by setting right retention period.

Have a look at one more SE Question and other article

score 2 · Answer 2 · answered Oct 19 '15 at 15:48

As you mention, Kafka has some problems getting items by key. It really only provides APIs for FIFO paradigm. I would advise to use a dedicated storage software, Cassandra, MongoDB, I even seen Solr used to store text. It would be easier to use something designed for key retrieval rather than try to modify Kafka yourself and most likely introduce bugs/issues that could take forever to solve.

As SQL.injection said, you'll have to manage the storage and logic by yourself. Storm doesn't offer such a mechanism.

Streaming Big Data - where to store intermediate results?

2 Answers2