Java Mapreduce group by compositekey and sort

Question

I have a mapreduce job that does some processing and produces a composite key (implements WritableComparable) of city:fruit with an associated count. Now I want to chain it with a secondary mapreduce job that determines the city with the highest count for each fruit type.

Sample composite key output from mapreduce job 1:

+---------------------+-------+
| city:fruit composite| count |
+---------------------+-------+
| london:apples       | 3     |
+---------------------+-------+
| london:bannanas     | 2     |
+---------------------+-------+
| london:oranges      | 15    |
+---------------------+-------+
| charleston:apples   | 20    |
+---------------------+-------+
| charleston:bannanas | 1     |
+---------------------+-------+
| charleston:oranges  | 3     |
+---------------------+-------+
| chicago:bannanas    | 17    |
+---------------------+-------+
| chicago:apples      | 5     |
+---------------------+-------+
| chicago:oranges     | 11    |
+---------------------+-------+

Desired output from job 2:

+------------+----------+
| city       | fruit    |
+------------+----------+
| london     | oranges  |
+------------+----------+
| charleston | apples   |
+------------+----------+
| chicago    | bannanas |
+------------+----------+

How can I accomplish this? In my SQL mind, the composite key would be two columns, one for city, one for fruit. I would group by the fruit, sort, and grab the row with the highest count. I can't figure out how that translates to the mapreduce world. Any advice would be appreciated!

Matthias Kricke · Accepted Answer · 2016-06-17T08:43:29.463

2

Process

Read your data into a new map reduce job
Split your information into city as key and a compound value of fruit:count
In the reduce phase you have every value for a city at hand. Now you can iterate over all of those values in a loop. Split them up and remember the biggest fruit count and fruit.
Now write your data to a database or the HDFS

Be aware that for each reducer a separate file is written. You can easily merge them afterwards with HDFS functionality. There is also the possibility to only have one reducer, however I didn't like this way because it is not scalable.

edited Jun 17 '16 at 08:43

answered Jun 16 '16 at 10:16

Matthias Kricke

4,931
4
29
43

This was exactly correct, thanks! Is there a way to sort the fuit:count composite before its sent to the reducer? – ph34r Jun 16 '16 at 23:54
1

Yes, but this would be a bit more to write. Please ask a new question and I try to answer it if you provide the link here. – Matthias Kricke Jun 17 '16 at 08:33
1

But to give you a hint, SecondaryOrdering is the keyword you want to search. This will not sort in the map phase but before the reduce phase – Matthias Kricke Jun 17 '16 at 08:49
I did some research on secondary ordering/sorting but can't seem to find materials that focus on composite values. I've posted a new question: http://stackoverflow.com/questions/37891385/java-mapreduce-sort-composite-value – ph34r Jun 17 '16 at 22:46

Java Mapreduce group by compositekey and sort

1 Answers1