What Hadoop Mapreduce can achieve?

Question

I am reading Hadoop mapreduce tutorials and come up with the following shallow understanding. Could anyone help confirm if my understanding is correct?

Mapreduce is a way to aggregate data

in a distributed environment
with non-structured data in very large files
using Java, Python, etc.

to produce similar results like what could be done in RDBMS using SQL aggregate functions

select count, sum, max, min, avg, k2 
  from input_file
 group by k2

map() method basically pivots horizontal data v1 that is a line from the input file into vertical rows, with each row having a string key and a numeric value.
The grouping will happen in the shuffling and partition stage of the data flow.
reduce() method will be responsible for computing/aggregating data.

Mapreduce jobs can be combined/nested just as SQL statement can be nested to produce complex aggregation output.

Is that correct?

With Hive on top of Hadoop, MR code will be generated by HiveQL Process Engine. Therefore from coding perspective, MR coding using Java will gradually be replaced with high level HiveQL. Is that true?

score 1 · Accepted Answer · edited May 23 '17 at 11:51

Have a look at this post for comparison between RDBMS & Hadoop

1.Unlike RDBMS, Hadoop can handle Peta bytes of data, which is distributed over thousands of nodes using commodity hardware. The efficiency of Map reduce algorithm lies with data locality during processing of data.

2.RDBMS can handle structured data only unlike Hadoop, which can handle structured, unstructured and semi-structured data.

Your understanding is correct regarding aggregation, grouping and partitioning.

You have provided example only for processing structured data.

HiveQL is getting converted into series of Map reduce jobs. On performance wise, HiveQL jobs will be slower compared to raw Map reduce jobs. HiveQL can't handle all types of data as explained above and hence it can't replace Map reduce jobs with java code.

HiveQL will co-exists with Map Reduce jobs in other languages. If you are looking for performance as key criteria of your map reduce job, you have to consider Java Map Reduce job as alternative. If you are looking for Map reduce jobs for semi-structured & un-structured data, you have to consider alternatives for Hive QL map reduce jobs.

ravindra, thank you for the answer. Very good point about performance. — Shawn, Nov 13 '15 at 21:15

What Hadoop Mapreduce can achieve?

1 Answers1