With reference to the basic WordCount example: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html I know that HDFS divide files in blocks, and maps tasks works on a single block. So there is no guarantee the block analyzed by a map task would not contain a word continuing in the next block, causing a mistake ( one word counted twice ). I know this is an example, and is always shown with small file, but wouldn't be a problem in real world scenarios?
Asked
Active
Viewed 130 times
0
-
Do you mean that block analyzed by a map task will not contain splits? – Rajen Raiyarela Feb 16 '16 at 05:50
-
@RajenRaiyarelaI mean they can contain both the same word: the beginning of the word in the first block, and the ending in the second block. – Felice Pollano Feb 16 '16 at 06:07
-
Possible duplicate of [How does Hadoop process records split across block boundaries?](http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries) – Rajen Raiyarela Feb 16 '16 at 06:38
1 Answers
1
In Hadoop you work on input splits and not on blocks. An input split is a complete data set. You want to avoid the case wherein one mapper goes over two splits as this costs performance as well as you create traffic.
In a text world, lets say you are in block1 and you have a sentence such as "I am a Ha" and block2 continues with "doop developer", then this creates network traffic as we always have to work on a node with a full input split and some data has to be transferred to the other node.

PradeepKumbhar
- 3,361
- 1
- 18
- 31

Stefan Papp
- 2,199
- 1
- 28
- 54