I am using Hadoop Mapreduce to sort a large document and using the KeyFieldBasedPartitioner
to partition different inputs to different reducers. The idea I have to solve this problem is to have the mapper send the first letter of each word as the key, and the word as the value. Each word with the same letter will go to one reducer, which will sort all the words starting with that letter, and then at the end I will just use -getmerge
and merge all the results into one document and view a fully sorted document.
So the entire process for me so far looks like this:
Giant document -> mapper (removes punctuation and splits words) -> outputs first letter, word pair into KeyFieldBasedPartitioner
-> sends it to one of 26 reducers (one for each letter) -> reducer sorts
Right now the reducers all sort their respective parts, but when I use -getmerge
to combine them, the document starts at words that start with 'n' and then end at words that start with 'm'. How can I specify it so that the final output is in order?