0

I was reading hadoop definitive guide , It was written Map Reduce is good for updating larger portions of the database , and it uses Sort & Merge to rebuild the database which is dependent on transfer time .

Also RDBMS is good for updating only smaller portions of a big database , It uses a B-Tree which is limited by seek time

Can anyone elaborate on what both these claims really mean ?

redeemed
  • 471
  • 7
  • 22
  • Have a look at my post @ http://stackoverflow.com/questions/32538650/hadoop-comparison-to-rdbms/32546933#32546933 and https://dzone.com/articles/oracle-vs-teradata-vs-hadoop-1 & http://stackoverflow.com/questions/13911501/when-to-use-hadoop-hbase-hive-and-pig/33433532#33433532 – Ravindra babu Oct 31 '15 at 07:03
  • Nope , that does not answer my question – redeemed Oct 31 '15 at 09:29
  • My question was what exactly sort/merge does in rebuilding database in a Mapreduce paradigm and how is it related to transfer time . And how is a B-Tree limited by seek time – redeemed Oct 31 '15 at 14:28
  • Major difference : RDBMS sorts data in single (or limited) state of the art hard ware nodes & Hadoop can sort the same data by storing & processing it on thousands of nodes. Data locality plays important role here. Data will be processed on the node where it is available ( most of the times). Mapper out will be sent reducer over the network – Ravindra babu Oct 31 '15 at 17:47

1 Answers1

0

I am not really sure what the book means, but you will usually do a map reduce job to rebuild the entire database/anything if you still have the raw data.

The real good thing about hadoop is that it's distributed, so performance is not really a problem since you could just add more machines.

Let's take an example, you need to rebuild a complex table with 1 billion rows. With RDBMS, you can only scale vertically, so you will be depending more on the power of the CPU, and how fast the algorithm is. You will be doing it with some SQL command. You will need to select a few data, process them, do stuffs, etc. So you will most likely be limited by the seek time.

With hadoop map reduce, you could just add more machines, so performance is not the problem. Let's say you you use 10000 mappers, that means the task will be divided to 10000 mapper containers, and because of hadoop's nature, all these containers usually already have the data on their harddrive stored locally. The output of each mapper is always a key value structured format on their local harddrive. These data are sorted using the key by the mapper.

Now the problem is, they need to combine the data together, so all of these data will be sent to a reducer. This happens through the network, is usually the slowest part if you have big data. The reducer will receive all of the data and will merge-sort them for further processing. In the end you have a file which could be just uploaded to your database.

The transfer from mapper to reducer is usually what's taking the longest time if you have a lot of data, and network is usually your bottleneck. Maybe this is what it meant by depending on the transfer time.

Rowanto
  • 2,819
  • 3
  • 18
  • 26
  • Hi Rowanto , thanks for your answer , the transfer time they are talking about is basically disk transfer time , Was wondering how map reduce takes advantage of the disk transfer rate – redeemed Nov 01 '15 at 10:54
  • This post kind of touches upon the central theme of the question ; http://stackoverflow.com/questions/22353122/why-is-a-block-in-hdfs-so-large – redeemed Jan 14 '16 at 14:17