Hadoop- calculating word co-occurrence (edge case)

Question

Suppose I want to calculate word co-occurrence using hadoop (measuring the frequency of two word appearing one after the other). So this is a well known problem, with a well know solution. For each document the mapper reads, it outputs pairs ((w,u),1) where w & u are words that appear one after the other. The reducer then sums the occurrences for each (w,u) pair.

My question is as follows: HDFS partitions large files to blocks (128M or 256M), and each mapper operates on a different block. so the above algorithm will miss counting the pairs of words that are in the boundaries of 2 blocks. For example, if the original document had the words "hello world" and after the split to blocks "hello" was placed as the last word of block #1 and "world" as the first word of block #2, then the above algorithm will not count this co-occurrence.

How can we handle this edge-case with hadoop?

Thanks, Aliza

possible duplicate of [Hadoop - how are map-reduce tasks know which part of a file to handle?](http://stackoverflow.com/questions/8894902/hadoop-how-are-map-reduce-tasks-know-which-part-of-a-file-to-handle) — Thomas Jungblut, Jul 13 '14 at 14:12
That post does not answer my question. The only thing that was semi-relates was this: One way is to let a single mapper process the complete file by using the FileInputFormat#isSplitable method. This is not an efficient approach if the file size is too large. — Aliza, Jul 14 '14 at 05:28

score 1 · Accepted Answer · edited May 23 '17 at 10:33

1

This is normally handled transparently by hadoop (see How does Hadoop process records split across block boundaries? for example)

edited May 23 '17 at 10:33

Community

1
1

answered Aug 08 '14 at 07:14

NJ73

86
1

Hadoop- calculating word co-occurrence (edge case)

1 Answers1