Confusion in Apache Nutch, HBase, Hadoop, Solr, Gora

Question

I am new to all these terms and given some time to understand it. But i have some confusions in it. Please correct me if i am wrong.

Nutch: It's for web crawling, using it we can crawl web pages. We can store these web pages somewhere in db.

Solr: Solr can be used for indexing web pages crawled by Apache Nutch. It helps in searching the indexes web pages.

HBase: It's used as an interface to interact with Hadoop. It helps in getting data at real time from HDFS. It provides simple SQL type interface for interacting.

Hadoop: It provides two functionalities: One is HDFS (Hadoop data file system) and other is Map-Reduce functionality taken from Google algorithms. Its basically used for offline data backup etc.

Gora and ZooKeeper: I am not sure of.

Confusions: 1). Is HBase a key-value pair DB or just an interface to Hadoop ? or i should ask, can HBase exist without Hadoop ? If yes, can you explain a bit more about its usage.

2). Is there any use of crawling data using Apache Nutch without indexing into Solr ?

3). For running apache nutch, do we need HBase and Hadoop ? If no, how we can make it work without it?

4). Is Hadoop part of HBase ?

score 0 · Answer 1 · edited May 23 '17 at 11:45

Here is a good short discussion of HBase vs. Hadoop: Difference between HBase and Hadoop/HDFS

Because HBase is built on top of Hadoop you can't really have HBase without Hadoop.

Yes you can run Nutch without Solr; there do not seem to be lots of use cases, however, much less living examples in the wild.
Yes, you can run Nutch without Hadoop, but again there don't seem to be a lot of real-world examples of people doing this.
Yes Hadoop is part of HBase, in that there is no HBase without Hadoop, but of course Hadoop is used for other things as well.

Zookeeper is used for configuration, naming, synchronization, etc. in Hadoop stack workflows. Gora is a memory management/persistence framework and is built on top of Hadoop.

Thanks Mike. It was good explanation and same answer which i expected. Can you please point me to some tutorials where nutch and Hbase integration internal working are written ? I am facing some issues. Here is one of them: http://stackoverflow.com/questions/29292977/apache-nutch-solr-and-hbase-integration-issues-on-mac How Hbase interact with Hadoop and works internally. — user3089214, Mar 27 '15 at 12:30

Confusion in Apache Nutch, HBase, Hadoop, Solr, Gora

1 Answers1