I am reading Hadoop mapreduce tutorials and come up with the following shallow understanding. Could anyone help confirm if my understanding is correct?
Mapreduce is a way to aggregate data
- in a distributed environment
- with non-structured data in very large files
- using Java, Python, etc.
to produce similar results like what could be done in RDBMS using SQL aggregate functions
select count, sum, max, min, avg, k2
from input_file
group by k2
- map() method basically pivots horizontal data v1 that is a line from the input file into vertical rows, with each row having a string key and a numeric value.
- The grouping will happen in the shuffling and partition stage of the data flow.
- reduce() method will be responsible for computing/aggregating data.
Mapreduce jobs can be combined/nested just as SQL statement can be nested to produce complex aggregation output.
Is that correct?
With Hive on top of Hadoop, MR code will be generated by HiveQL Process Engine. Therefore from coding perspective, MR coding using Java will gradually be replaced with high level HiveQL. Is that true?