I have a piece of code that reads from two large files using generators, and stops when EOF is reached in one of the two files. I'd like to know (1) which generator reached the EOF first, (2) the progress of each generator ie the value of i
in the generators (see code below) when the first generator reaches EOF, and (3) the number of lines remaining in the other generator. I do not know ahead of time how long each file is, and would like to avoid pre-scanning the files.
I know I can get the progress by:
increment a counter every time I call
next()
(this is ugly!), orlet the generator return a counter (see
counter1
andcounter2
in code),
but in both cases, I won't know which of gen1
or gen2
reached EOF.
I also figured out I can add a 'message' to the StopIteration
exception, but I was wondering if there is a better way. After the first try...except
block, can I somehow figure out which one has not reached EOF yet and to advance it? (I tried using close()
or throw()
on the generator, or the finally
clause inside the generator, but didn't really understand them.)
def gen1(fp):
for i, line in enumerate(fp):
int_val = process_line(line)
yield int_val, i
raise StopIteration, ("gen1", i)
def gen2(fp):
for i, line in enumerate(fp):
float_val = process_line_some_other_way(line)
yield float_val, i
raise StopIteration, ("gen2", i)
g1 = gen1(open('large_file', 'r'))
g2 = gen2(open('another_large_file', 'r'))
try:
val1, counter1 = next(g1)
val2, counter2 = next(g2)
progress += 1
while True: # actual code is a bit more complicated than shown here
while val1 > val2:
val2, counter2 = next(g2)
while val1 < val2:
val1, counter1 = next(g1)
if val1 == val2:
do_something()
val1, counter1 = next(g1)
val2, counter2 = next(g2)
except StopIteration as err:
first_gen_name, first_num_lines = err.args
gen1_finished_first = gen_name == 'gen1'
# Go through the rest of the other generator to get the total number of lines
the_remaining_generator = g2 if gen1_finished_first else g1
try:
while True:
next(the_remaining_generator)
except StopIteration as err:
second_gen_name, second_num_lines = err.args
if gen1_finished_first:
print 'gen1 finished first, it had {} lines.'.format(first_num_lines) # same as `counter1`
print 'gen2 was at line {} when gen1 finished.'.format(counter2)
print 'gen2 had {} lines total.'.format(second_num_lines)
else:
... # omitted