-1

Suppose I want to calculate word co-occurrence using hadoop (measuring the frequency of two word appearing one after the other). So this is a well known problem, with a well know solution. For each document the mapper reads, it outputs pairs ((w,u),1) where w & u are words that appear one after the other. The reducer then sums the occurrences for each (w,u) pair.

My question is as follows: HDFS partitions large files to blocks (128M or 256M), and each mapper operates on a different block. so the above algorithm will miss counting the pairs of words that are in the boundaries of 2 blocks. For example, if the original document had the words "hello world" and after the split to blocks "hello" was placed as the last word of block #1 and "world" as the first word of block #2, then the above algorithm will not count this co-occurrence.

How can we handle this edge-case with hadoop?

Thanks, Aliza

Aliza
  • 734
  • 1
  • 10
  • 25
  • possible duplicate of [Hadoop - how are map-reduce tasks know which part of a file to handle?](http://stackoverflow.com/questions/8894902/hadoop-how-are-map-reduce-tasks-know-which-part-of-a-file-to-handle) – Thomas Jungblut Jul 13 '14 at 14:12
  • That post does not answer my question. The only thing that was semi-relates was this: One way is to let a single mapper process the complete file by using the FileInputFormat#isSplitable method. This is not an efficient approach if the file size is too large. – Aliza Jul 14 '14 at 05:28

1 Answers1

1

This is normally handled transparently by hadoop (see How does Hadoop process records split across block boundaries? for example)

Community
  • 1
  • 1
NJ73
  • 86
  • 1