I don't have any specific query but design question. I am new to spark/streaming hence forgive me if I am asking dumb question. Please delete it if question is inappropriate for this forum.
So basically we have requirement where we have to process huge amount of data every hour and produce o/p for reporting in kibana (elastic search). Lets suppose we have two data model as shown below. DataModel-1 represent the hash tag and userid of people who tweeted with that hash. Second data Model DataModel-2 contain zip and users how are in that zip. DataModel-1 data is stream data and we get almost 40K events per second. DataModel-2 don't change that frequently. In output we need data through which we can see a trend of tag for given zip. Like in given time zip how many users are tweeting with given tag.
I have below questions
- We can use spark stream with kafka? However my concern is will we able to scale with 40k feed per second. Though we will get answer as we have started the POC on it. But just wanted to know about others experience and tuning we can apply to achieve it.
- If I am going with batch processing like every 1 hour what should be good data store where I can save tweet and later process it. Will Oracle and MySQL will be good for storing the data and then loading it in spark? Or I should dump it in hdfs?
- What can good reporting platform apart from Kibana?
DataModel-1 [{ hash: #IAMHAPPY, users: [123,134,4566,78899] }]
DataModel-2 [{ zip: zip1 users: [123,134] },{ zip: zip2 users: [4566,78899] }]
Report Data Model [ { zip: zip1, hash: [#IAMHAPPY] }, { zip: zip2, hash: [#IAMHAPPY] } ]