4

I have a piece of code that reads from two large files using generators, and stops when EOF is reached in one of the two files. I'd like to know (1) which generator reached the EOF first, (2) the progress of each generator ie the value of i in the generators (see code below) when the first generator reaches EOF, and (3) the number of lines remaining in the other generator. I do not know ahead of time how long each file is, and would like to avoid pre-scanning the files.

I know I can get the progress by:

  1. increment a counter every time I call next() (this is ugly!), or

  2. let the generator return a counter (see counter1 and counter2 in code),

but in both cases, I won't know which of gen1 or gen2 reached EOF.

I also figured out I can add a 'message' to the StopIteration exception, but I was wondering if there is a better way. After the first try...except block, can I somehow figure out which one has not reached EOF yet and to advance it? (I tried using close() or throw() on the generator, or the finally clause inside the generator, but didn't really understand them.)

def gen1(fp):
    for i, line in enumerate(fp):
        int_val = process_line(line)
        yield int_val, i
    raise StopIteration, ("gen1", i)

def gen2(fp):
    for i, line in enumerate(fp):
        float_val = process_line_some_other_way(line)
        yield float_val, i
    raise StopIteration, ("gen2", i)

g1 = gen1(open('large_file', 'r'))
g2 = gen2(open('another_large_file', 'r'))

try:
    val1, counter1 = next(g1)
    val2, counter2 = next(g2)
    progress += 1
    while True:  # actual code is a bit more complicated than shown here
        while val1 > val2:
            val2, counter2 = next(g2)
        while val1 < val2:
            val1, counter1 = next(g1)
        if val1 == val2:
            do_something()
            val1, counter1 = next(g1)
            val2, counter2 = next(g2)

except StopIteration as err:
    first_gen_name, first_num_lines = err.args

gen1_finished_first = gen_name == 'gen1'

# Go through the rest of the other generator to get the total number of lines
the_remaining_generator = g2 if gen1_finished_first else g1
try:
    while True:
        next(the_remaining_generator)
except StopIteration as err:
    second_gen_name, second_num_lines = err.args

if gen1_finished_first:
    print 'gen1 finished first, it had {} lines.'.format(first_num_lines) # same as `counter1`
    print 'gen2 was at line {} when gen1 finished.'.format(counter2)
    print 'gen2 had {} lines total.'.format(second_num_lines)
else:
    ... # omitted
obk
  • 488
  • 5
  • 12
  • 1
    Do you have to use generators? If not, you can define a class that is iterable, that knows how to return the next line from a file, that keeps track of bytes consumed relative to total file size, that keeps track of whether the opened file handle is exhauted, and that therefore can report progress at any time. – FMc Jul 03 '14 at 02:01
  • Instead of incrementing a counter when you `next` or building a counter into the generator, why not just `enumerate` the generator? – user2357112 Jul 03 '14 at 02:05
  • @user2357112 I'm not sure I understand what you mean. Note that I advance (call `next` on) the generator on multiple lines within the `while` loop. – obk Jul 03 '14 at 17:01
  • 1
    @obk: You can wrap the generator in an `enumerate` iterator and then just `next` that. With a `gen1` that doesn't have a counter built in, it'd look like `g1 = enumerate(gen1(whatever))` and then `i, val = next(g1)`. – user2357112 Jul 03 '14 at 17:33

2 Answers2

2

I think you may want to use an iterator class instead -- it's implemented with a standard Python class and can have whatever extra attributes you need (such as an exhausted flag).

Something like the following:

# untested
class file_iter():
    def __init__(self, file_name):
        self.file = open(file_name)
        self.counted_lines = 0
        self.exhausted = False
    def __iter__(self):
        return self
    def __next__(self):
        if self.exhausted:
            raise StopIteration
        try:
            next_line = self.file.readline()
            self.counted_lines += 1
            return next_line
        except EOFError:
            self.file.close()
            self.exhausted = True
            raise StopIteration
Community
  • 1
  • 1
Ethan Furman
  • 63,992
  • 20
  • 159
  • 237
  • Yep, I think a class is a perfect solution here. – kindall Jul 03 '14 at 03:14
  • Thanks for this, it cleaned up my code greatly. One thing though: I needed to manually check for `if len(next_line) == 0` instead of `EOFError` See this [post](http://stackoverflow.com/a/15599780/3294994). – obk Aug 01 '14 at 22:26
1

You can use chain to tack a special EOF value to the end of your generator. Eg.

from itertools import chain
EOF = object()
fin = open('somefile')
src = enumerate(chain(fin, [EOF]))
while True:
    idx, row = next(src)
    if row == EOF:
        break  # End of file
    print idx, row

You may also be able to use izip_longest. Replace f1 and f2 with your generators

from itertools import count, izip_longest
EOF = object()
with open('f1') as f1, open('f2') as f2:
    for i, r1, r2 in izip_longest(count(), f1, f2, fillvalue=EOF):
        if EOF in (r1, r2):
            print i, r1, r2
            break
John La Rooy
  • 295,403
  • 53
  • 369
  • 502