How should I deal with file handles when I return a generator?

Question

I have this function in my code:

def load_fasta(filename):
    f = open(filename)
    return (seq.group(0) for seq in re.finditer(r">[^>]*", f.read()))

This will leave the file open indefinitely, which isn't good practice. How do I close the file when the generator is exhausted? I guess I could expand the generator expression into a for loop with yield statements and then close the file afterwards. I'm trying to use functional programming as often as possible, though (just as a learning exercise). Is there a different way to do this?

Actually the __with__ open(filename, [mode]) statement takes care of opening and closing file. — Nf4r, Oct 03 '16 at 18:41
If you are using `f.read()` there is no need for generator here! — Mazdak, Oct 03 '16 at 18:41
Is this because f.read() will read the whole file into memory? — Colin, Oct 03 '16 at 19:01

chepner · Answer 1 · 2016-10-03T18:51:18.323

1

Use yield instead of a single generator expression.

def load_fasta(filename):
    with open(filename) as f:
        for seq in re.finditer(r">[^>]*", f.read()):
            yield seq.group(0)

for thing in load_fasta(filename):
    ...

The with statement will close the file once the for loop completes. Note that since you read the entire file into memory anyway, you could simply use

def load_fasta(filename):
    with open(filename) as f:
        data = f.read()
    for seq in re.finditer(r">[^>]*", data):
        yield seq.group(0)

edited Oct 03 '16 at 18:51

answered Oct 03 '16 at 18:42

chepner

497,756
71
530
681

I think you should use `f.read()` or loop over the lines then pas them to `finditer()`, which if you use `f.read()` actually you are loading the whole file in memory so the generator would be useless – Mazdak Oct 03 '16 at 18:43
Yeah, I wrote this up hoping that `finditer` could use an iterator itself instead of a string. It can't, so this isn't really any better than just returning a list of matches. – chepner Oct 03 '16 at 18:43
Is there a way to do this without loading everything into memory? – Colin Oct 03 '16 at 19:03
Not that I am aware of. The problem is that, in general, the regex engine may need to do backtracking, which means if it were reading directly from a generator, it might need to go backwards. Someone may have written some sort of `onlinefinditer` that does its own buffering to *minimize* the amount of data kept in memory, but there isn't anything like that in the standard library. – chepner Oct 03 '16 at 19:07
Is there a way to do it without regex? Since all I really need to do is split the file at '>' characters. – Colin Oct 03 '16 at 19:11
As far as I know, you would need to do your own buffering. (Reading a fixed-size chunk of bytes into a buffer and attempting to read a record from that.) http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python might have some useful tips. – chepner Oct 03 '16 at 19:39

How should I deal with file handles when I return a generator?

1 Answers1