Spark Streaming design questions

Question

I don't have any specific query but design question. I am new to spark/streaming hence forgive me if I am asking dumb question. Please delete it if question is inappropriate for this forum.

So basically we have requirement where we have to process huge amount of data every hour and produce o/p for reporting in kibana (elastic search). Lets suppose we have two data model as shown below. DataModel-1 represent the hash tag and userid of people who tweeted with that hash. Second data Model DataModel-2 contain zip and users how are in that zip. DataModel-1 data is stream data and we get almost 40K events per second. DataModel-2 don't change that frequently. In output we need data through which we can see a trend of tag for given zip. Like in given time zip how many users are tweeting with given tag.

I have below questions

We can use spark stream with kafka? However my concern is will we able to scale with 40k feed per second. Though we will get answer as we have started the POC on it. But just wanted to know about others experience and tuning we can apply to achieve it.
If I am going with batch processing like every 1 hour what should be good data store where I can save tweet and later process it. Will Oracle and MySQL will be good for storing the data and then loading it in spark? Or I should dump it in hdfs?
What can good reporting platform apart from Kibana?

DataModel-1 [{ hash: #IAMHAPPY, users: [123,134,4566,78899] }]

DataModel-2 [{ zip: zip1 users: [123,134] },{ zip: zip2 users: [4566,78899] }]

Report Data Model [ { zip: zip1, hash: [#IAMHAPPY] }, { zip: zip2, hash: [#IAMHAPPY] } ]

In regards to #2, you may find this answer interesting https://stackoverflow.com/a/39753976/3723346 — plamb, Nov 14 '17 at 17:15
so, it's not a design suggestion, but a design *question* - edited your title — desertnaut, Nov 15 '17 at 14:08
@desertnaut I have updated the question :). Since I don't have much experience in spark I need more of a suggestion. — Rishi Saraf, Nov 16 '17 at 06:02

score 1 · Answer 1 · answered Nov 16 '17 at 06:42

Yes. I think with your task 40K messages/s not something hard to reach out. But...
If you're going to process every 1 hour, DON'T use spark streaming. You can store data in 1 hour to HDFS then process it with normal offline spark application. It's much more reasonable way than streaming in your usecase.
I have no idea but ELK is good.

score 0 · Answer 2 · answered May 16 '22 at 08:58

My opinions are below:

Of course you can use spark stream with kafka, and it should meet your requirements of 40K events per sec.
However, since you will be performing a batch operation, spark streaming is not recommended, you can dump data in HDFS itself and use open source tools like Apache Ignite for processing with spark. Article
Afaik kibana will be a good fit here, for data model visualisation grafana now as well provides capability of building dashboards.

Spark Streaming design questions

2 Answers2