13

hi i wanted to learn how to sort the word count by value in hadoop.i know hadoop takes of sorting keys, but not by values.

i know to sort the values we must have a partitioner,groupingcomparator and a sortcomparator

but i am bit confused in applying these concepts together to sort the word count by value.

do we need another map reduce job to achieve the same or else a combiner to count the occurrences and then sort here and emit the same to reducer?

can any one explain how to sort word count example by values?

user1585111
  • 1,019
  • 6
  • 19
  • 35

2 Answers2

12

You need to have a second mapreduce job. Unless you conclude on the the totals counts (which the first MR job does) how can you think of sorting by value (the counts of the words)? Logically not possible.

Rags
  • 1,891
  • 18
  • 19
  • i mean just sorting based on number of occurences – user1585111 Aug 23 '13 at 15:31
  • Yes. I got the same understanding. To determine the number of occurrences, you need to run a MR job. Only at the end of processing a key the number of occurrences can be determined. When the next key comes the earlier key of out the context for Reduce task. So it is not possible to have the word as the key and sort by value. You need to pipe the out to another MR job and use the value as the key in the second job. – Rags Aug 23 '13 at 15:36
  • im just a beginner,your answer is helpful.thank you – user1585111 Aug 23 '13 at 15:45
  • You are welcome. Wish you the best. – Rags Aug 23 '13 at 16:43
  • 3
    You can also pipe the output through a standard *nix executable like `sort`, which would work just fine. You can sort numerically on the second field. Something like `cat part-* | sort -nk 2`, `-n` being numeric, `-k 2` being field 2. – tommy_o Aug 23 '13 at 23:18
8

This is called as secondary sort. See this and this for details.

Tariq
  • 34,076
  • 8
  • 57
  • 79
  • 3
    Secondary sort doesn't help in sorting by number of occurrences as asked in the question. Impossible to acheive!. – rbyndoor Oct 03 '16 at 10:56
  • 1
    @ruby : the question is about sorting the result of a wordcount job, based on values, which is the count of each word. what makes you think this is impossible to achieve? – Tariq Oct 04 '16 at 11:36
  • 2
    No..Based on user comments it's very clear that user1585111 wants to sort by number of occurrences. That's not what secondary sort can do. – rbyndoor Oct 05 '16 at 13:35