what's a pipepline of two mapreduce jobs?

Question

Is that means the output of the first mapreduce works as the input of the second mapreduce job? So, there totally have two mapreduce jobs? And if the input is a sequences of couples(client, date), output is (date,client, max_requests). How to use a pipeline of two mapreduce jobs to find the client with most requests for each day.

Possible duplicate of [Chaining multiple MapReduce jobs in Hadoop](http://stackoverflow.com/questions/2499585/chaining-multiple-mapreduce-jobs-in-hadoop) — Ravindra babu, May 16 '16 at 09:57

score 0 · Answer 1 · answered May 14 '16 at 11:09

MR is essentially a way to generate an output of a dataset sorted by a different key. The reduce function can already aggregate to the final results.

In your case, the Mapper would map the input data to records of the type you describe.

,,other_data

and map this to:

<_>,other_data

It's probably easiest to generate a composite key. Rather than creating a subclass that implements the K,V, you can map this to a string type that does the sorting correctly per day and per client:

YYYYMMDD_

This guarantees that all records per day per client are processed by the same reducer. You can then simply count the number of records and output that as a new record for that day. Then decompose the original key and emit new records that look like:

YYYYMMDD,client,

Then there are options: - Load the files in memory if that fits and determine the max per day. - Load the files into a database and do SQL selects - Run another MR job:

identity mapper: YYYYMMDD,client,<# requests> -> YYYYMMDD (key) + client,#requests

then in the reducer, you now get all clients for a single day. Then you simply maintain the state per day what the highest # requests was and for which client and when the date changes or you reach the end of the file, output the client id that had the highest number.

Personally I think the easiest is to use BigQuery from the google cloud platform. You can load your file in (gzipped) into a simple schema and run a BQ statement against that:

select date, client, num_requests, row_number() over(partition by date order by num_requests desc) as rn
from (
  select date, client, count(client) as num_requests from my_table 
group by date, client
) as T
where rn = 1

what's a pipepline of two mapreduce jobs?

1 Answers1