Is that means the output of the first mapreduce works as the input of the second mapreduce job? So, there totally have two mapreduce jobs? And if the input is a sequences of couples(client, date), output is (date,client, max_requests). How to use a pipeline of two mapreduce jobs to find the client with most requests for each day.
1 Answers
MR is essentially a way to generate an output of a dataset sorted by a different key. The reduce function can already aggregate to the final results.
In your case, the Mapper would map the input data to records of the type you describe.
,,other_data
and map this to:
<_>,other_data
It's probably easiest to generate a composite key. Rather than creating a subclass that implements the K,V, you can map this to a string type that does the sorting correctly per day and per client:
YYYYMMDD_
This guarantees that all records per day per client are processed by the same reducer. You can then simply count the number of records and output that as a new record for that day. Then decompose the original key and emit new records that look like:
YYYYMMDD,client,
Then there are options: - Load the files in memory if that fits and determine the max per day. - Load the files into a database and do SQL selects - Run another MR job:
identity mapper: YYYYMMDD,client,<# requests> -> YYYYMMDD (key) + client,#requests
then in the reducer, you now get all clients for a single day. Then you simply maintain the state per day what the highest # requests was and for which client and when the date changes or you reach the end of the file, output the client id that had the highest number.
Personally I think the easiest is to use BigQuery from the google cloud platform. You can load your file in (gzipped) into a simple schema and run a BQ statement against that:
select date, client, num_requests, row_number() over(partition by date order by num_requests desc) as rn
from (
select date, client, count(client) as num_requests from my_table
group by date, client
) as T
where rn = 1

- 279
- 2
- 15