2

In my reducer, I require the total number of "lines" of input that were processed by the mappers.

sample input:

  • line,1,of,input
  • line,2,of,input
  • line,3,of,input

So, in all of the Reducers, I need to have access to the whatever was emitted by the Mappers plus the total number of lines (in this case 3).

I'm assuming that I will need either multiple jobs or chain together some mappers and/or reducers but I'm unsure of the proper way.

Note: This is not a simple average program, so I can't just have a single key from the mapper.

Brandon Bil
  • 405
  • 1
  • 5
  • 8

3 Answers3

4

job.getCounters().findCounter(TaskCounter.MAP_INPUT_RECORDS).getValue() to get the total numbers of records.

disco crazy
  • 31,313
  • 12
  • 80
  • 83
2

What you need here is a counter: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/Counters.html

Hadoop predefines a set of standard counters (included the number of processed line by the mappers, which is maybe what you're looking for), but you can also define your own custom counter. Here's a sample of how to do it: Accessing a mapper's counter from a reducer

Community
  • 1
  • 1
Andrea Iacono
  • 772
  • 7
  • 20
0

Assuming that you have not specified a custom Record Reader, you just need to get the value of the counter MAP_INPUT_RECORDS in the setup or configure method of your reducer (based on whether you use the new, or the old API, respectively).

See this post and this post, for instructions on both APIs.

Community
  • 1
  • 1
vefthym
  • 7,422
  • 6
  • 32
  • 58