0

I am just beginner in Hadoop framework. I would like to understand few concepts here and i browsed so many links but I would like to get clear answers 1) Why Map reduce works only with Key Value pairs.I also read that I can create a Map reduce job with out actually using reduce 2) Key for the input of Mapping phase is fileoffset key. Can I use explicit key value? or custom input ?

Karthi
  • 708
  • 1
  • 19
  • 38
  • 1) You need to understand the concept of the shuffle and sort phase to know why key-value makes sense. 2) You can use whatever key you want. For reading pretty much any spiltable file, the offset is perfect as it allows evenly divisible blocks to be mapped across – OneCricketeer Mar 02 '16 at 05:10
  • For your intermediate question. Please read http://stackoverflow.com/questions/10630447/hadoop-difference-between-0-reducer-and-identity-reducer – OneCricketeer Mar 02 '16 at 05:14

1 Answers1

1

Good, you are digging hadoop concepts.

1) Can I use explicit key value? or custom input?: Yes, write your own (overwrite) RecordReader to do so.

2) Why Map reduce works only with Key Value pairs?: MapReduce, as name suggests, program just maps(filters) required data to Reduce(Combine based on unique keys) from the data set fed to the program. Now, why key-value pair?: Since you are processing on unstructured data, one would not like to get the same as output too. We will require some manipulations on data. Think of using Map in java, it helps to uniquely identify the pair, so does in hadoop with the help of Sort & Shuffle.

create a Map reduce job with out actually using reduce?: Ofcourse, completely depends but recommended for only small operations and in a scenario where your mapper outputs are not required to be combined for expected output.

Reason: Here is where Distributed concept, commodity hardware to be given a priority. For example: i have a large data set to process upon. While processing the data set using a java program(just java, not hadoop), we store the required in Collection objects (As simple as using RAM space). Hadoop is introduced to do the same job in different fashion: store required data in context. Context in mapper refers to Intermediate data (Local FS), in reducer refers to Output(HDFS). Ofcourse, Context in both the cases store in HardDisk.

Hadoop helps doing all the calculations in HardDisk instead of RAM.

I suggest read Hadoop Defenitive Guide, Data Algorithms book for better understanding.

srikanth
  • 958
  • 16
  • 37