Hadoop comparison to RDBMS

Question

I really do not understand the actual reason behind hadoop scaling better than RDBMS . Can anyone please explain at a granular level ? Has this got something to do with underlying datastructures & algorithms

Hadoop is NOT a replacement for RDBMS. I suggest that you stop listening to hearsay and read about both the technologies to know the reality. — toddlermenot, Sep 12 '15 at 12:25

score 6 · Accepted Answer · edited May 23 '17 at 11:33

RDBMS have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.

EDIT: To answer, why RDBMS cannot scale, have a look at Overheads of RBDMS.

Logging. Assembling log records and tracking down all changes in database structures slows performance. Logging may not be necessary if recoverability is not a requirement or if recoverability is provided through other means (e.g., other sites on the network).

Locking. Traditional two-phase locking poses a sizeable overhead since all accesses to database structures are governed by a separate entity, the Lock Manager.

Latching. In a multi-threaded database, many data structures have to be latched before they can be accessed. Removing this feature and going to a single-threaded approach has a noticeable performance impact.

Buffer management. A main memory database system does not need to access pages through a buffer pool, eliminating a level of indirection on every record access.

How Hadoop handles?:

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment, which can run on commodity hardware. It is useful for storing & retrieval of huge volumes of data.

This scalability & efficiency are possible with Hadoop implementation of storage mechanism (HDFS) & processing jobs (YARN Map reduce jobs). Apart from scalability, Hadoop provides high availability of stored data.

Scalability, High Availability, Processing of huge volumes of data (Strucutred data, Unstructured data, Semi structured data) with flexibility are key to success of Hadoop.

Data is stored on thousands of nodes & processing is done on the node where data is stored (most of the times) through Map Reduce jobs. Data Locality on processing front is one key area of success of Hadoop.

This has been achieved with Name Node, Data Node & Resource Manager.

To understand how Hadoop achieve this, you should must visit these links : HDFS Architecture , YARN Architecture and HDFS Federation

Still RDBMS is good for multiple write/read/updates and consistent ACID transactions on Giga bytes of data. But not good for processing of Tera bytes & Peta bytes of data. NoSQL with two of Consistency ,Availability Partitioning attributes of CAP theory is good in some of use cases.

But Hadoop is not meant for real time transaction support with ACID properties. It is good for Business intelligence reporting with batch processing - "Write once, multiple read" paradigm.

From slideshare.net

Have a look at one more related SE question :

NoSql vs Relational database

I agree , Data locality is a key feature in Hadoop wherein code moves to where data is , and the data doesn't flow over network to be processed . The point you've mentioned above regarding RAID , does it concern with clustering capability in RDBMS . Pardon me , I am not a database guy . If it's a way to achieve clustering in the RDBMS world , then what's the core reason behind it requiring expensive hardware to perform better — redeemed, Sep 13 '15 at 11:25
RAID + Data shreds is RDBMS way of scalability but not much successful. It is very expensive with limited success — Ravindra babu, Sep 13 '15 at 11:31
Thanks . But my question is what is the underlying reason behind the non-scalability of RDBMS . — redeemed, Sep 15 '15 at 04:34
RDBMS can handle Giga bytes of data and Hadoop provides framework to support Tera/Peta bytes of data. Map reduce is the key to achieve this due to processing on data node with data locality. — Ravindra babu, Sep 15 '15 at 05:38
RDBMS do not support hadoop kind of framework where processing can be done on 2000+ Data nodes and still results can be published with centralised controller — Ravindra babu, Sep 15 '15 at 05:43
https://dzone.com/articles/oracle-vs-teradata-vs-hadoop-1 will clarify which one to use depending on use case. — Ravindra babu, Sep 15 '15 at 06:03

score 1 · Answer 2 · answered Sep 12 '15 at 16:40

First, hadoop IS NOT a DB replacement.

RDBMS scale vertical and hadoop scale horizontal.

This means that to scale twice a RDBMS you need to have hardware with the double memory, double storage and double cpu. That is very expensive and has limits. There isn't a server with 10TB of ram for example. With hadoop is different, you don't need expensive edge technology, instead of that you can use several commodity servers working together to simulate a bigger server (with some limitations). You can have a cluster with 10 Tb of ram distributed in several nodes.

Other advantage is that instead to have to buy a new more powerful server and drop the old one, to scale distributed systems only require to add new nodes into the cluster.

score 0 · Answer 3 · answered Aug 07 '18 at 14:51

The one issue if have with the description above is that paralleled RDBMS required expensive hardware. Teridata and Netezza need special hardware. Greenplum and Vertica can be put on commodity hardware. (Now I will admit I am biased, like everyone else.) I have seen Greenplum scan petabytes of information daily. (Walmart was up to 2.5 petabytes last I hard.) I dealt with both Hawq and Impala. They both require about 30% more hardware to do the same job on structured data. Hbase is less efficient.

There is no magic silver spoon. It has been my experience that both structured and unstructured have their place. Hadoop is great for ingesting large amounts of data and scanning through it a small amount of times. We use it as part of our load procedures. RDBMS is grate at scanning the same data over and over with highly complex queries.

You always have to structure the data to make use of it. That structuring takes time somewhere. You ether structure before you put it in to an RDBMS or at query time .

score -1 · Answer 4 · answered Jul 18 '16 at 12:32

In RDBMS , data is structured , rather it is indexed. Retrieval of data of any particular 'nth' column is loading the entire database and then selecting the 'nth' column.

where as in Hadoop, say Hive, we load the only the particular column from the entire data set. More so over the data loading is also done by Map reduce programs which is done in a distributed structure which reduce the overall time.

Hence, two advantages of using Hadoop and its tools.

Hadoop comparison to RDBMS

4 Answers4

Linked