0

1st of all this not a question asking for help to deploy below components step by step. What I'm asking is for an advice on how the architecture should be designed. What I'm planning to do is develop a reporting platform using existing data. Following is data I gathering by researching.

I have an existing RDBMS which has large number of records. So I'm using

  • Scoop - Extract data from RDBMS to Hadoop
  • Hadoop - Storage platform
  • Hive - Datawarehouse
  • Spark - Since Hive is more like batch processing Spark on Hive will speed up things
  • JasperReports - To generate reports.

What I have done up to know is deployed a Hadoop 2 cluster as follows

  • 192.168.X.A - Namenode
  • 192.168.X.B - 2nd Namenode
  • 192.168.X.C - Slave1
  • 192.168.X.D - Slave2
  • 192.168.X.E - Slave3

My problems are

  • In which node should I deploy Spark? A or B, Given that I want to support fail-over. That's why I have a separate namenode configured on B.
  • Should I deploy Spark on each and every instances? Who are the worker nodes should be?
  • In which node should I deploy Hive? Is there a better alternative to Hive?
  • How should I connect JasperReports? And to where? To Hive or Spark?

Please tell me a suitable way to design the architecture? Please provide a elaborated answer.

Note that if you can provide any technical guides or case studies in similar nature it would be really helpful.

Techie
  • 44,706
  • 42
  • 157
  • 243

1 Answers1

1

You've figured it out, already! All my answers are merely general opinions and might drastically change depending on data, flavors of operations to be performed. Also question implies data and results of such operations are mission critical, I assumed so.

Spark on Hive will speed up things

Not necessarily correct. Anecdotal evidence, this post (by cloudera), proves the quite opposite. There is actually a move towards the vice-versa, i.e. Hive on Spark.

In which node should I deploy Spark? A or B, Given that I want to support fail-over. That's why I have a separate namenode configured on B. Should I deploy Spark on each and every instances? Who are the worker nodes should be?

Definitely - in most cases anyway. Set A or B as master, all of the rest can be worker nodes. If you don't want to have SPOF in your architecture, see high availability section of spark documentation, requires a bit of extra work.

Is there a better alternative to Hive?

This one is both subjective and task-specific. If SQL querying feels natural and fits the task, there is also Impala promoted by Cloudera, which claims to perform and order of magnitude faster than Hive. But is sort of a stranger in Apache Hadoop ecosystem. With Spark -and if you are fine typing a bit of python or scala- you can do SQL-like querying while still enjoying the expressive power these languages provide.

How should I connect JasperReports? And to where? To Hive or Spark?

Don't know about this one.

mehmetminanc
  • 1,359
  • 9
  • 14
  • +1 for the answer. I have few follow up questions. 1. What if I have configured Spark on node A and it goes down. Hadoop will work since it has B(2nd namenode). What will happen to Spark? 2. Do you have any guide to deploy Spark cluster and hive on top of it? – Techie Nov 11 '15 at 03:13
  • 1
    Actually both of these are answered, in the links given in the post. [See high availability section from these spark document](https://spark.apache.org/docs/1.5.1/spark-standalone.html). [These](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started) [two](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark) posts should give you an idea about hive on spark. Please also look at Cloudera CDH and Apache Ambari, these are cluster management platforms and takes a lot of burden off your shoulders. – mehmetminanc Nov 11 '15 at 06:29
  • @Techie, hi, we are currently looking for an architecture and within past 5 years EDW that you have built has been probably matured. Based on your experiences, do you recommend the setup that you had mentioned in the question or any other alternatives to go with? Thank you. – kayhan yüksel Aug 18 '21 at 15:03