1st of all this not a question asking for help to deploy below components step by step. What I'm asking is for an advice on how the architecture should be designed. What I'm planning to do is develop a reporting platform using existing data. Following is data I gathering by researching.
I have an existing RDBMS which has large number of records. So I'm using
- Scoop - Extract data from RDBMS to Hadoop
- Hadoop - Storage platform
- Hive - Datawarehouse
- Spark - Since Hive is more like batch processing Spark on Hive will speed up things
- JasperReports - To generate reports.
What I have done up to know is deployed a Hadoop 2 cluster as follows
- 192.168.X.A - Namenode
- 192.168.X.B - 2nd Namenode
- 192.168.X.C - Slave1
- 192.168.X.D - Slave2
- 192.168.X.E - Slave3
My problems are
- In which node should I deploy Spark? A or B, Given that I want to support fail-over. That's why I have a separate namenode configured on B.
- Should I deploy Spark on each and every instances? Who are the worker nodes should be?
- In which node should I deploy Hive? Is there a better alternative to Hive?
- How should I connect JasperReports? And to where? To Hive or Spark?
Please tell me a suitable way to design the architecture? Please provide a elaborated answer.
Note that if you can provide any technical guides or case studies in similar nature it would be really helpful.