0

I'm trying to understand how map-reduce actually work. please read what i written below and tell me if there's any missing parts or incorrect things in here. Thank you.

The data is first splitted into what is called input splits(which is a logical kind of group which we define the size of it as our needs of record processing). Then, there is a Mapper for every input split which takes every input split and sort it by key and value. Then, there is the shuffling process which takes all of the data from the mappers (key-values) and merges all the same keys with its values(output it's all the keys with its list of values). The shuffling process occurs in order to give the reducer an input of a 1 key for each type of key with its summed values. Then, the Reducer merges all the key value into one place(page maybe?) which is the final result of the MapReduce process. We only have to make sure to define the Map(which gives output of key-value always) and Reduce(final result- get the input key-value and can be count,sum,avg,etc..) step code.

  • Would you be willing to edit your question to demonstrate what you've attempted? – Ryan Morton Mar 06 '18 at 15:14
  • @Ryan Morton , I'm actually just trying to understand the concept. i've seen an example on counting words and read some articles,so i want to make sure i got everything correct.(i'm new to big data, hadoop ,etc..). –  Mar 06 '18 at 15:42

2 Answers2

1

Your understanding is slightly wrong specially how mapper works. I got a very nice pictorial image to explain in simple term

enter image description here

It is similar to the wordcount program, where

  • Each bundle of chocolates are the InputSplit, which is handled by a mapper. So we have 3 bundles.
  • Each chocolate is a word. One or more words (making a sentence) is a record input to single mapper. So, within one inputsplit, there may be multiple records and each record is input to single mapper.
  • mapper count occurrence of each of the word (chocolate) and spit the count. Note that each of the mapper is working on only one line (record). As soon as it is done, it picks next record from the inputsplit. (2nd phase in the image)

  • Once map phase is finished, sorting and shuffling takes place to make a bucket of same chocolates counts. (3rd phase in the image)

  • One reducer get one bucket with key as name of the chocolate (or the word) and a list of counts. So, there are as many reducer as many distinct words in whole input file.
  • The reducer iterates through the count and sum them up to produce the final count and emit it against the word.

The Below diagram shows how one single inputsplit of wordcount program works:

enter image description here

Gyanendra Dwivedi
  • 5,511
  • 2
  • 27
  • 53
  • so is input split just part of the data to be map-reduced? meaning for every time i have map reduce on some data ill probably have more than one input split? what does it help me to have more than one input split , considering mappers only care about number of records.. –  Mar 06 '18 at 20:14
  • Yes, in practical scenario; there would be more than 1 `inputsplit`. By default, HDFS break the files into blocks of 256MB (1 GB = 4 Blocks). This one block is input to one mapper in the form of `inputsplit` (but it does not mean that block and `inputsplit` are same). You need to know about `blocksize`, `inputsplit` and how it is processed in `mapreduce`. – Gyanendra Dwivedi Mar 06 '18 at 20:44
  • i think i get it now. we use this logical size called input split in order to determine boundaries of the block pretty much since there may be half a record in the block. Thanks mate! –  Mar 07 '18 at 06:19
  • Yeah, you got it right. Kindly upvote the answer, if it helped. – Gyanendra Dwivedi Mar 07 '18 at 06:24
  • already did. since im new to the website it said that the upvote doesn't show, but it still count as a upvote. –  Mar 07 '18 at 06:30
  • Thanks, Happy to help you. – Gyanendra Dwivedi Mar 07 '18 at 06:31
0

Similar QA - Simple explanation of MapReduce?

Also, this post explain Hadoop - HDFS & Mapreduce in very simple way https://content.pivotal.io/blog/demystifying-apache-hadoop-in-5-pictures

Pradeep Bhadani
  • 4,435
  • 6
  • 29
  • 48