-3

i'm involved in a project with 2 phases and i'm wondering if this is a big data project (i'm newbie in this field)

In the first phase i have this scenario:

  • i have to collect huge amont of data
  • i need to store them
  • i need to build a web application that shows data to the users

In the second phase i need to analyze stored data and builds report and do some analysis on them

Some example about data quantity; in one day i may need to collect and store around 86.400.000 record

Now i was thinking to this kind of architecture:

  • to colect data some asynchronous tecnology like Active MQ and MQTT protocol
  • to store data i was thinking about a NoSQL DB (mongo, Hbase or other)

Now this would solve my first phase problems

But what about the second phase?

I was thinking about some big data SW (like hadoop or spark) and some machine learning SW; so i can retrieve data from the DB, analyze them and build or store in a better way in order to build good reports and do some specific analysis

I was wondering if this is the best approach

How would you solve this kind of scenario? Am I in the right way?

thank you

Angelo

Community
  • 1
  • 1
Angelo Immediata
  • 6,635
  • 4
  • 33
  • 65

2 Answers2

1

As answered by siddhartha, whether your project can be tagged as bigdata project or not, depends on context and buiseness domain/case of your project.

Coming to tech stack, each of the technology you mentioned has specific purpose. For example if you have structured data, you can use any new age base database with query support. NoSQL databases come in different flavours (columner, document based, key-value, etc), so technology choice depends again on the kind of data and use-case that you have. I suggest you to do some POCs and analysis of technologies before taking final calls.

Puneet Khatod
  • 161
  • 1
  • 5
0

Definition of big data varies from user to user. For Google 100 TB might be a small data but for me this is big data because of difference in available Hardware commodity. Ex -> Google can have cluster of 50000 nodes each node having 64 GB Ram for analysing 100 Tb of data so for them this not big data. But I cannot have cluster of 50000 node so for me it is big data.

Same is your case if have commodity hardware available you can go ahead with hadoop. As you have not mentioned size of file you are generating each day I cannot be certain about your case. But hadoop is always a good choice to process your data because of new projects like spark which can help you process data in much less time and moreover it also give you features of real time analysis. So according to me it is better if you can use spark or hadoop because then you can play with your data. Moreover since you want to use nosql database you can use hbase which is available with hadoop to store your data.

Hope this answers your question.

siddhartha jain
  • 1,006
  • 10
  • 16
  • well it's exactly what I'm thinking... but i'm wondering: is activeMQ+MQTT good enough to collect huge amount of data (86 million of record at day mean around 1000 records at second)? I was thinking to use HBase+Hadoop+Hive+mohout (with Samsara) and I think i'm in the right way.... – Angelo Immediata Jul 16 '16 at 06:28