Get number of input records in Hadoop reducer

Question

In my reducer, I require the total number of "lines" of input that were processed by the mappers.

sample input:

line,1,of,input
line,2,of,input
line,3,of,input

So, in all of the Reducers, I need to have access to the whatever was emitted by the Mappers plus the total number of lines (in this case 3).

I'm assuming that I will need either multiple jobs or chain together some mappers and/or reducers but I'm unsure of the proper way.

Note: This is not a simple average program, so I can't just have a single key from the mapper.

score 4 · Answer 1 · answered Oct 26 '15 at 16:33

4

job.getCounters().findCounter(TaskCounter.MAP_INPUT_RECORDS).getValue() to get the total numbers of records.

answered Oct 26 '15 at 16:33

disco crazy

31,313
12
80
83

score 2 · Accepted Answer · edited May 23 '17 at 12:08

2

What you need here is a counter: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/Counters.html

Hadoop predefines a set of standard counters (included the number of processed line by the mappers, which is maybe what you're looking for), but you can also define your own custom counter. Here's a sample of how to do it: Accessing a mapper's counter from a reducer

edited May 23 '17 at 12:08

Community

1
1

answered Jul 16 '15 at 08:38

Andrea Iacono

772
7
20

Thanks, this is looking like what I need. – Brandon Bil Jul 16 '15 at 18:20

score 0 · Answer 3 · edited May 23 '17 at 12:22

0

Assuming that you have not specified a custom Record Reader, you just need to get the value of the counter MAP_INPUT_RECORDS in the setup or configure method of your reducer (based on whether you use the new, or the old API, respectively).

See this post and this post, for instructions on both APIs.

edited May 23 '17 at 12:22

Community

1
1

answered Jul 16 '15 at 08:33

vefthym

7,422
6
32
58

Get number of input records in Hadoop reducer

3 Answers3