2

I am using Hadoop Mapreduce to sort a large document and using the KeyFieldBasedPartitioner to partition different inputs to different reducers. The idea I have to solve this problem is to have the mapper send the first letter of each word as the key, and the word as the value. Each word with the same letter will go to one reducer, which will sort all the words starting with that letter, and then at the end I will just use -getmerge and merge all the results into one document and view a fully sorted document.

So the entire process for me so far looks like this:

Giant document -> mapper (removes punctuation and splits words) -> outputs first letter, word pair into KeyFieldBasedPartitioner -> sends it to one of 26 reducers (one for each letter) -> reducer sorts

Right now the reducers all sort their respective parts, but when I use -getmerge to combine them, the document starts at words that start with 'n' and then end at words that start with 'm'. How can I specify it so that the final output is in order?

  • Have you checked out this existing question: http://stackoverflow.com/questions/14322381/mapreduce-job-output-sort-order – Ed Baker Oct 10 '16 at 01:01
  • I did see that question, but the solution to that question seemed like it was for Java users. I can't go back to using a single reducer, and I am forced to use the `KeyFieldBasedPartitioner` so implementing my own partitioner isn't an option. –  Oct 10 '16 at 02:11

0 Answers0