Hadoop MapReduce WordCount example flaw?

Question

With reference to the basic WordCount example: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html I know that HDFS divide files in blocks, and maps tasks works on a single block. So there is no guarantee the block analyzed by a map task would not contain a word continuing in the next block, causing a mistake ( one word counted twice ). I know this is an example, and is always shown with small file, but wouldn't be a problem in real world scenarios?

Do you mean that block analyzed by a map task will not contain splits? — Rajen Raiyarela, Feb 16 '16 at 05:50
@RajenRaiyarelaI mean they can contain both the same word: the beginning of the word in the first block, and the ending in the second block. — Felice Pollano, Feb 16 '16 at 06:07
Possible duplicate of [How does Hadoop process records split across block boundaries?](http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries) — Rajen Raiyarela, Feb 16 '16 at 06:38

score 1 · Accepted Answer · edited Feb 16 '16 at 09:54

In Hadoop you work on input splits and not on blocks. An input split is a complete data set. You want to avoid the case wherein one mapper goes over two splits as this costs performance as well as you create traffic.

In a text world, lets say you are in block1 and you have a sentence such as "I am a Ha" and block2 continues with "doop developer", then this creates network traffic as we always have to work on a node with a full input split and some data has to be transferred to the other node.

Hadoop MapReduce WordCount example flaw?

1 Answers1