0

Is that means the output of the first mapreduce works as the input of the second mapreduce job? So, there totally have two mapreduce jobs? And if the input is a sequences of couples(client, date), output is (date,client, max_requests). How to use a pipeline of two mapreduce jobs to find the client with most requests for each day.

tang
  • 9
  • 1
  • Possible duplicate of [Chaining multiple MapReduce jobs in Hadoop](http://stackoverflow.com/questions/2499585/chaining-multiple-mapreduce-jobs-in-hadoop) – Ravindra babu May 16 '16 at 09:57

1 Answers1

0

MR is essentially a way to generate an output of a dataset sorted by a different key. The reduce function can already aggregate to the final results.

In your case, the Mapper would map the input data to records of the type you describe.

,,other_data

and map this to:

<_>,other_data

It's probably easiest to generate a composite key. Rather than creating a subclass that implements the K,V, you can map this to a string type that does the sorting correctly per day and per client:

YYYYMMDD_

This guarantees that all records per day per client are processed by the same reducer. You can then simply count the number of records and output that as a new record for that day. Then decompose the original key and emit new records that look like:

YYYYMMDD,client,

Then there are options: - Load the files in memory if that fits and determine the max per day. - Load the files into a database and do SQL selects - Run another MR job:

identity mapper: YYYYMMDD,client,<# requests> -> YYYYMMDD (key) + client,#requests

then in the reducer, you now get all clients for a single day. Then you simply maintain the state per day what the highest # requests was and for which client and when the date changes or you reach the end of the file, output the client id that had the highest number.

Personally I think the easiest is to use BigQuery from the google cloud platform. You can load your file in (gzipped) into a simple schema and run a BQ statement against that:

select date, client, num_requests, row_number() over(partition by date order by num_requests desc) as rn
from (
  select date, client, count(client) as num_requests from my_table 
group by date, client
) as T
where rn = 1
radialmind
  • 279
  • 2
  • 15