0

With reference to the basic WordCount example: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html I know that HDFS divide files in blocks, and maps tasks works on a single block. So there is no guarantee the block analyzed by a map task would not contain a word continuing in the next block, causing a mistake ( one word counted twice ). I know this is an example, and is always shown with small file, but wouldn't be a problem in real world scenarios?

Felice Pollano
  • 32,832
  • 9
  • 75
  • 115
  • Do you mean that block analyzed by a map task will not contain splits? – Rajen Raiyarela Feb 16 '16 at 05:50
  • @RajenRaiyarelaI mean they can contain both the same word: the beginning of the word in the first block, and the ending in the second block. – Felice Pollano Feb 16 '16 at 06:07
  • Possible duplicate of [How does Hadoop process records split across block boundaries?](http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries) – Rajen Raiyarela Feb 16 '16 at 06:38

1 Answers1

1

In Hadoop you work on input splits and not on blocks. An input split is a complete data set. You want to avoid the case wherein one mapper goes over two splits as this costs performance as well as you create traffic.

In a text world, lets say you are in block1 and you have a sentence such as "I am a Ha" and block2 continues with "doop developer", then this creates network traffic as we always have to work on a node with a full input split and some data has to be transferred to the other node.

PradeepKumbhar
  • 3,361
  • 1
  • 18
  • 31
Stefan Papp
  • 2,199
  • 1
  • 28
  • 54