Suppose I want to calculate word co-occurrence using hadoop (measuring the frequency of two word appearing one after the other). So this is a well known problem, with a well know solution. For each document the mapper reads, it outputs pairs ((w,u),1) where w & u are words that appear one after the other. The reducer then sums the occurrences for each (w,u) pair.
My question is as follows: HDFS partitions large files to blocks (128M or 256M), and each mapper operates on a different block. so the above algorithm will miss counting the pairs of words that are in the boundaries of 2 blocks. For example, if the original document had the words "hello world" and after the split to blocks "hello" was placed as the last word of block #1 and "world" as the first word of block #2, then the above algorithm will not count this co-occurrence.
How can we handle this edge-case with hadoop?
Thanks, Aliza