Relationship between HDFS, HBase, Pig, Hive and Azkaban?

Question

I am somewhat new to Apache Hadoop. I have seen this and this questions about Hadoop, HBase, Pig, Hive and HDFS. Both of them describe comparisons between above technologies.

But, I have seen that, typically a Hadoop environment contains all these components (HDFS, HBase, Pig, Hive, Azkaban).

Can some one explain the relationship of those components/technologies with their responsibilities inside a Hadoop environment, in an architectural workflow manner? preferably with an example?

score 7 · Accepted Answer · answered Jun 04 '16 at 12:57

General Overview:

HDFS is Hadoop's Distributed File System. Intuitively you can think of this as a filesystem that spans many servers.

HBASE is a column oriented datastore. It is modeled after Google's Big Table, but if that's not something you knew about then think of it as a non-relational database that provides real time read/write access to data. It is integrated into Hadoop.

Pig and Hive are ways of querying for data in the Hadoop ecosystem. The main difference being that Hive is more like SQL than Pig. Pig uses what is called Pig Latin.

Azkaban is a prison, I mean batch workflow job scheduler. So basically it's similar to Oozie in that you can run map/reduce, pig, hive, bash, etc as a single job.

At the highest level possible, you can think of HDFS being your filesystem with HBASE as the datastore. Pig and Hive would be your means of querying from your datastore. Then Azkaban would be your way of scheduling jobs.

Stretched Example:

If you're familiar with Linux ext3 or ext4 for a filesystem, MySQL/Postgresql/MariaDB/etc for a database, SQL to access the data, and cron to schedule jobs. (You can interchange ext3/ext4 for NTFS and cron for Task Scheduler on Windows)

HDFS takes the place of ext3 or ext4 (and is distributed), HBASE takes the database role (and is non-relational!), Pig/Hive are a way to access the data, and Azkaban is a way to schedule jobs.

NOTE: This is not an apples to apples comparison. It's just to demonstrate that the Hadoop components are an abstraction meant to give you a workflow that you're already probably familiar with.

I highly encourage you to look into the components further as you'll have a good amount of fun. Hadoop has so many interchangeable components ( Yarn,Kafka,Oozie,Ambari, ZooKeeper, Sqoop, Spark, etc) that you'll be asking this question to yourself a lot.

EDIT: The links you posted went more into detail about HBase and Hive/Pig so I tried to give an intuitive picture of how they all fit together.

Are these correct as I understood? 1) Hive/Pig both are for same purpose (data access though different in use), if you go with one other one is optional. 2) HBase is a built on top of HDFS. — Supun Wijerathne, Jun 04 '16 at 14:12
Yeah you can look at it like that. At the surface, Hive and Pig both provide a means of doing the same thing. They were developed initially by 2 different groups so the philosophy and use case for both is slightly different. Since Hive is more like SQL (HiveQL), it should work well with structured data. Pig is pretty good for semi-structured I would imagine. I will note that I am not an expert so this is just a thought/opinion from what I've learned. — K.Boyette, Jun 06 '16 at 12:58
As for HBase, I don't really have experience with it so I can't tell you, for sure but I found this link that might help: http://thenewstack.io/a-look-at-hbase/ — K.Boyette, Jun 06 '16 at 12:58

score 2 · Answer 2 · answered Jun 04 '16 at 19:54

Hadoop environment contains all these components (HDFS, HBase, Pig, Hive, Azkaban). Short description of them can be:-

HDFS -storage in hadoop framework.

HBase - It is columnar database. where you store data in the form of column for faster access. yes , it does use hdfs as its storage.

Pig - data flow language, its community has provided builtin functions to load and process semi structured data like json and xml along with structured data.

Hive - Querying language to run queries over tables, table mounting is necessary here to play with HDFS data.

Azkaban - If you have pipeline of hadoop jobs, you can schedule them to run at specific timings and after or before some dependency.

If I ask like this, can you please indicate the workflow of above components? I am going to develop a set of data-retrieve apis using Java. Each api calls the Hbase to get data giving a row_key. Can you please indicate the workflow/interconnection of above components in the event of, from "calling the HBase using a row_key" and "get the set of data"? — Supun Wijerathne, Jun 05 '16 at 03:06

Relationship between HDFS, HBase, Pig, Hive and Azkaban?

2 Answers2