0

I am reading input through stdin (hadoop streaming in reducer).

I need to detect when last record comes in. I am running for loop on stdin data.

I tried to read the stdin first to calculate the total records and then again read to proceed with business processing, but as soon as I read a record from stdin to calculate total_cnt then the records goes out from the stream and later when I try to read stdin for processing there is no record in stdin.

total_cnt = 0 

for line in stdin:  
    total cnt += 1

for line in stdin:  
   ##Some Processing##

I don't want to store stdin to somewhere and read data from that location twice (1. total record count and 2. data processing).

Is there any way I can detect when last record comes in from stdin?

I am using python version 2.7.11 and need to implement this in approach in Hadoop reducer.

TemporalWolf
  • 7,727
  • 1
  • 30
  • 50
Shantanu Sharma
  • 3,661
  • 1
  • 18
  • 39

1 Answers1

1

Process the previous line each time you take in a new one. When the loop exits, line will have your last, unprocessed line to do with as you please.

Example:

old_line = None
for line in range(10):
    if old_line is None:
        old_line = line
        continue  # skip processing on the first loop: we'll make it up after
    print "Do stuff with: %i" % old_line
    old_line = line
print "Double last line: %i" % (line*2)

which gives:

Do stuff with: 0
Do stuff with: 1
Do stuff with: 2
Do stuff with: 3
Do stuff with: 4
Do stuff with: 5
Do stuff with: 6
Do stuff with: 7
Do stuff with: 8
Double last line: 18
TemporalWolf
  • 7,727
  • 1
  • 30
  • 50