15

I want to treat many files as if they were all one file. What's the proper pythonic way to take [filenames] => [file objects] => [lines] with generators/not reading an entire file into memory?

We all know the proper way to open a file:

with open("auth.log", "rb") as f:
    print sum(f.readlines())

And we know the correct way to link several iterators/generators into one long one:

>>> list(itertools.chain(range(3), range(3)))
[0, 1, 2, 0, 1, 2]

but how do I link multiple files together and preserve the context managers?

with open("auth.log", "rb") as f0:
    with open("auth.log.1", "rb") as f1:
        for line in itertools.chain(f0, f1):
            do_stuff_with(line)

    # f1 is now closed
# f0 is now closed
# gross

I could ignore the context managers and do something like this, but it doesn't feel right:

files = itertools.chain(*(open(f, "rb") for f in file_names))
for line in files:
    do_stuff_with(line)

Or is this kind of what Async IO - PEP 3156 is for and I'll just have to wait for the elegant syntax later?

Conrad.Dean
  • 4,341
  • 3
  • 32
  • 41
  • 3
    Also note that `files = itertools.chain(*(open(f, "rb") for f in file_names))` is definitely not good in this context. unpacking the tuple causes all of your files to be opened before you actually enter the `chain` constructor. You're better off with `itertools.chain.from_iterable(open(fname,'r') for fname in filenames))` -- In fact, this is a classic reason why the `from_iterable` classmethod needs to exist in the first place :). – mgilson Apr 19 '13 at 02:07
  • @mgilson had no idea `from_iterable` was a thing! I'm glad my usecase is a textbook example for why it's useful. I was trying to figure out how to properly get the lazy evaluation to work without nested for loops. Thanks! – Conrad.Dean Apr 19 '13 at 02:51
  • Note that even the `from_iterable` doesn't guarantee that all of your files are closed when you're done iterating over it because you never know when `__del__` will actually run (though I'm pretty sure that they will be in Cpython)... – mgilson Apr 19 '13 at 02:53
  • 1
    There is [`contextlib.ExitStack`](http://docs.python.org/3.4/library/contextlib.html#contextlib.ExitStack) that allows to treat multiple context managers as one (it is not needed in your case but might be useful in related cases). – jfs Apr 22 '13 at 18:51

1 Answers1

21

There's always fileinput.

for line in fileinput.input(filenames):
    ...

Reading the source however, it appears that fileinput.FileInput can't be used as a context manager1. To fix that, you could use contextlib.closing since FileInput instances have a sanely implemented close method:

from contextlib import closing
with closing(fileinput.input(filenames)) as line_iter:
    for line in line_iter:
        ...

An alternative with the context manager, is to write a simple function looping over the files and yielding lines as you go:

def fileinput(files):
    for f in files:
        with open(f,'r') as fin:
            for line in fin:
                yield line

No real need for itertools.chain here IMHO ... The magic here is in the yield statement which is used to transform an ordinary function into a fantastically lazy generator.


1As an aside, starting with python3.2, fileinput.FileInput is implemented as a context manager which does exactly what we did before with contextlib. Now our example becomes:

# Python 3.2+ version
with fileinput.input(filenames) as line_iter:
    for line in line_iter:
        ...

although the other example will work on python3.2+ as well.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • @Blender -- It's a decent module that doesn't get used too much since it's functionality can be replaced by `chain.from_iterable`. `itertools` and `collections` are the more well known tools people reach for 90% of the time. I'm a little disappointed that it isn't implemented as a context manager though (It's not even a new-style class). It seems like it would be a pretty simple addition, but forunately, it's easy enough to wrap with contextlib. – mgilson Apr 19 '13 at 02:51
  • 3
    As of Python 3.2, `fileinput` can be used as a context manager (http://docs.python.org/3/library/fileinput.html. – Ned Deily Apr 19 '13 at 02:53
  • @NedDeily -- I'm glad that they implemented it [the same way I did](http://stackoverflow.com/posts/16095960/revisions) after reading the source for only a few minutes (and before realizing it was an old-style class) – mgilson Apr 19 '13 at 02:55
  • I wasn't sure if context managers even worked with old style classes... After a very quick test, it appears that they do... but `contextlib.closing` is still nicer than a subclass just for that I think ... – mgilson Apr 19 '13 at 02:58
  • @mgilson How about editing your answer to include the current (3.2+) usage? *That's* the most pythonic. – Ned Deily Apr 19 '13 at 03:06
  • Have to go with the manual generator. Some of the files in the list require special helpers to open them, where fileinput.input is using the generic file open. Awesome advice, thanks! – Conrad.Dean Apr 20 '13 at 15:56
  • 1
    @Conrad.Dean -- Reasonable enough. It is worth pointing out that `fileinput` takes an optional argument `openhook` which gets called instead of `open`. So you would just need to delegate the opening of the files to an `openhook` function. – mgilson Apr 21 '13 at 01:26
  • Is there an option to make this work with symbolic links as well or do I need to get the real path first before I pass the list of files to fileinput? – tommy.carstensen Apr 28 '13 at 21:26
  • @tommy.carstensen -- As far as I know, this should still work if you give it a symbolic link... – mgilson Apr 28 '13 at 23:12
  • I must have made some mistake. I fed it a list of symlinks and got an error. – tommy.carstensen Apr 29 '13 at 23:20
  • yield does comes handy for such jobs. For those who wants to master 'yield' please use this link: http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python – kzs Sep 09 '14 at 12:11