1

I have a mapreduce job that does some processing and produces a composite key (implements WritableComparable) of city:fruit with an associated count. Now I want to chain it with a secondary mapreduce job that determines the city with the highest count for each fruit type.

Sample composite key output from mapreduce job 1:

+---------------------+-------+
| city:fruit composite| count |
+---------------------+-------+
| london:apples       | 3     |
+---------------------+-------+
| london:bannanas     | 2     |
+---------------------+-------+
| london:oranges      | 15    |
+---------------------+-------+
| charleston:apples   | 20    |
+---------------------+-------+
| charleston:bannanas | 1     |
+---------------------+-------+
| charleston:oranges  | 3     |
+---------------------+-------+
| chicago:bannanas    | 17    |
+---------------------+-------+
| chicago:apples      | 5     |
+---------------------+-------+
| chicago:oranges     | 11    |
+---------------------+-------+

Desired output from job 2:

+------------+----------+
| city       | fruit    |
+------------+----------+
| london     | oranges  |
+------------+----------+
| charleston | apples   |
+------------+----------+
| chicago    | bannanas |
+------------+----------+

How can I accomplish this? In my SQL mind, the composite key would be two columns, one for city, one for fruit. I would group by the fruit, sort, and grab the row with the highest count. I can't figure out how that translates to the mapreduce world. Any advice would be appreciated!

ph34r
  • 233
  • 1
  • 2
  • 9

1 Answers1

2

Process

  1. Read your data into a new map reduce job
  2. Split your information into city as key and a compound value of fruit:count
  3. In the reduce phase you have every value for a city at hand. Now you can iterate over all of those values in a loop. Split them up and remember the biggest fruit count and fruit.
  4. Now write your data to a database or the HDFS

Be aware that for each reducer a separate file is written. You can easily merge them afterwards with HDFS functionality. There is also the possibility to only have one reducer, however I didn't like this way because it is not scalable.

Matthias Kricke
  • 4,931
  • 4
  • 29
  • 43
  • This was exactly correct, thanks! Is there a way to sort the fuit:count composite before its sent to the reducer? – ph34r Jun 16 '16 at 23:54
  • 1
    Yes, but this would be a bit more to write. Please ask a new question and I try to answer it if you provide the link here. – Matthias Kricke Jun 17 '16 at 08:33
  • 1
    But to give you a hint, SecondaryOrdering is the keyword you want to search. This will not sort in the map phase but before the reduce phase – Matthias Kricke Jun 17 '16 at 08:49
  • I did some research on secondary ordering/sorting but can't seem to find materials that focus on composite values. I've posted a new question: http://stackoverflow.com/questions/37891385/java-mapreduce-sort-composite-value – ph34r Jun 17 '16 at 22:46