45

It's a common idiom in python to use context manager to automatically close files:

with open('filename') as my_file:
    # do something with my_file

# my_file gets automatically closed after exiting 'with' block

Now I want to read contents of several files. Consumer of the data does not know or care if data comes from files or not-files. It does not want to check if the objects it received can be open or not. It just wants to get something to read lines from. So I create an iterator like this:

def select_files():
    """Yields carefully selected and ready-to-read-from files"""
    file_names = [.......]
    for fname in file_names:
        with open(fname) as my_open_file:
            yield my_open_file

This iterator may be used like this:

for file_obj in select_files():
    for line in file_obj:
        # do something useful

(Note, that the same code could be used to consume not the open files, but lists of strings - that's cool!)

The question is: is it safe to yield open files?

Looks like "why not?". Consumer calls iterator, iterator opens file, yields it to consumer. Consumer processes the file and comes back to iterator for next one. Iterator code resumes, we exit 'with' block, the my_open_file object gets closed, go to next file, etc.

But what if consumer never comes back to iterator for the next file? F.e. an exception occurred inside the consumer. Or consumer found something very exciting in one of the files and happily returned the results to whoever called it?

Iterator code would never resume in this case, we would never come to the end of 'with' block, and the my_open_file object would never get closed!

Or would it?

lesnik
  • 2,507
  • 2
  • 25
  • 24
  • 1
    The iterator would be cleaned up when it goes out of scope, which it should in the cases you mention. – J. P. Petersen Jan 26 '17 at 19:52
  • 2
    If you save a reference to the generator in the consumer (for instance, `producer=select_files()`) then you could use its `.throw` method to tell it to shut down. https://docs.python.org/3/reference/expressions.html#generator.throw. – Terry Jan Reedy Jan 26 '17 at 20:05
  • 1
    @TerryJanReedy Generators have a `close` method which better serves the purpose of stopping a generator instead of throwing a random exception in there... – Bakuriu Jan 27 '17 at 07:06
  • 1
    Anyway, the same issue happens if you simply yield the contents of the file: `with open(...) as f: for line in f: yield line`. The consumer may not exhaust the generator and hence the file may not be ever closed. This is an issue with "lazy I/O" in general. It's better to open files inside "eager" code and pass them to the lazy functions. – Bakuriu Jan 27 '17 at 07:09
  • 2
    While this doesn't directly address OP's question... An alternative way to handle this situation is to use [`fileinput`](https://docs.python.org/3.6/library/fileinput.html). See also http://stackoverflow.com/questions/16095855/whats-the-most-pythonic-way-to-iterate-over-all-the-lines-of-multiple-files/16095960#16095960 – mgilson Jan 27 '17 at 07:14

2 Answers2

30

You bring up a criticism that has been raised before1. The cleanup in this case is non-deterministic, but it will happen with CPython when the generator gets garbage collected. Your mileage may vary for other python implementations...

Here's a quick example:

from __future__ import print_function
import contextlib

@contextlib.contextmanager
def manager():
    """Easiest way to get a custom context manager..."""
    try:
        print('Entered')
        yield
    finally:
        print('Closed')


def gen():
    """Just a generator with a context manager inside.

    When the context is entered, we'll see "Entered" on the console
    and when exited, we'll see "Closed" on the console.
    """
    man = manager()
    with man:
        for i in range(10):
            yield i


# Test what happens when we consume a generator.
list(gen())

def fn():
    g = gen()
    next(g)
    # g.close()

# Test what happens when the generator gets garbage collected inside
# a function
print('Start of Function')
fn()
print('End of Function')

# Test what happens when a generator gets garbage collected outside
# a function.  IIRC, this isn't _guaranteed_ to happen in all cases.
g = gen()
next(g)
# g.close()
print('EOF')

Running this script in CPython, I get:

$ python ~/sandbox/cm.py
Entered
Closed
Start of Function
Entered
Closed
End of Function
Entered
EOF
Closed

Basically, what we see is that for generators that are exhausted, the context manager cleans up when you expect. For generators that aren't exhausted, the cleanup function runs when the generator is collected by the garbage collector. This happens when the generator goes out of scope (or, IIRC at the next gc.collect cycle at the latest).

However, doing some quick experiments (e.g. running the above code in pypy), I don't get all of my context managers cleaned up:

$ pypy --version
Python 2.7.10 (f3ad1e1e1d62, Aug 28 2015, 09:36:42)
[PyPy 2.6.1 with GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
$ pypy ~/sandbox/cm.py
Entered
Closed
Start of Function
Entered
End of Function
Entered
EOF

So, the assertion that the context manager's __exit__ will get called for all python implementations is untrue. Likely the misses here are attributable to pypy's garbage collection strategy (which isn't reference counting) and by the time pypy decides to reap the generators, the process is already shutting down and therefore, it doesn't bother with it... In most real-world applications, the generators would probably get reaped and finalized quickly enough that it doesn't actually matter...


Providing strict guarantees

If you want to guarantee that your context manager is finalized properly, you should take care to close the generator when you are done with it2. Uncommenting the g.close() lines above gives me deterministic cleanup because a GeneratorExit is raised at the yield statement (which is inside the context manager) and then it's caught/suppressed by the generator...

$ pypy ~/sandbox/cm.py
Entered
Closed
Start of Function
Entered
Closed
End of Function
Entered
Closed
EOF

$ python3 ~/sandbox/cm.py
Entered
Closed
Start of Function
Entered
Closed
End of Function
Entered
Closed
EOF

$ python ~/sandbox/cm.py
Entered
Closed
Start of Function
Entered
Closed
End of Function
Entered
Closed
EOF

FWIW, this means that you can clean up your generators using contextlib.closing:

from contextlib import closing
with closing(gen_function()) as items:
    for item in items:
        pass # Do something useful!

1Most recently, some discussion has revolved around PEP 533 which aims to make iterator cleanup more deterministic.
2It is perfectly OK to close an already closed and/or consumed generator so you can call it without worrying about the state of the generator.

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • "The cleanup in this case is non-deterministic" - I am not sure I totally understand this statement. Does it mean that what happens depends on garbage-collector behaviour? – lesnik Jan 26 '17 at 20:06
  • @lesnik -- Yes, that is what it means. – mgilson Jan 26 '17 at 20:10
  • 1
    @lesnik -- I was thinking about this more tonight (maybe because it bothers me that I don't always clean these things up well in _my_ code ...). Anyway, it appears that there _is_ a way force the generators to clean up when you're done with them. I've re-written/updated the answer to explain how that is possible. – mgilson Jan 27 '17 at 07:09
  • Special thanks for bringing attention to PEP-533 - it's a big surprise for me that garbage collector is involved here! – lesnik Jan 27 '17 at 07:28
  • You mention that if iterator is exhausted "context manager cleans up (more or less) when you expect". Why "more or less"? Isn't situation straightforward in this case? – lesnik Jan 27 '17 at 07:30
  • @lesnik -- I think so, but there's always someone who thinks it should happen differently ;-). Still, I probably shouldn't have hedged my statement there... – mgilson Jan 27 '17 at 07:31
10

Is it safe to combine 'with' and 'yield' in python?

I don't think you should do this.

Let me demonstrate making some files:

>>> for f in 'abc':
...     with open(f, 'w') as _: pass

Convince ourselves that the files are there:

>>> for f in 'abc': 
...     with open(f) as _: pass 

And here's a function that recreates your code:

def gen_abc():
    for f in 'abc':
        with open(f) as file:
            yield file

Here it looks like you can use the function:

>>> [f.closed for f in gen_abc()]
[False, False, False]

But let's create a list comprehension of all of the file objects first:

>>> l = [f for f in gen_abc()]
>>> l
[<_io.TextIOWrapper name='a' mode='r' encoding='cp1252'>, <_io.TextIOWrapper name='b' mode='r' encoding='cp1252'>, <_io.TextIOWrapper name='c' mode='r' encoding='cp1252'>]

And now we see they are all closed:

>>> c = [f.closed for f in l]
>>> c
[True, True, True]

This only works until the generator closes. Then the files are all closed.

I doubt that is what you want, even if you're using lazy evaluation, your last file will probably be closed before you're done using it.

Community
  • 1
  • 1
Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331