2

I am writing some Python code that loops through a number of files and processes the first few hundred lines of each file. I would like to extend this code so that if any of the files in the list are compressed, it will automatically decompress while reading them, so that my code always receives the decompressed lines. Essentially my code currently looks like:

for f in files:
    handle = open(f)
    process_file_contents(handle)

Is there any function that can replace open in the above code so that if f is either plain text or gzip-compressed text (or bzip2, etc.), the function will always return a file handle to the decompressed contents of the file? (No seeking required, just sequential access.)

spazm
  • 4,399
  • 31
  • 30
Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
  • That's not a duplicate. I know how to use `gzip.open`. I'm essentially asking if there's a function that looks at the file and automatically chooses `open`, `gzip.open`, or whatever other open function is appropriate for the compression being used, so I don't have to write a bunch of try/catch statements to try every possible open function myself. – Ryan C. Thompson Aug 21 '13 at 21:23
  • Something like [this](http://stackoverflow.com/questions/13044562/python-mechanism-to-identify-compressed-file-type-and-uncompress)? – Oli Aug 21 '13 at 21:56

1 Answers1

4

I had the same problem: I'd like my code to accept filenames and return a filehandle to be used with with, automatically compressed & etc.

In my case, I'm willing to trust the filename extensions and I only need to deal with gzip and maybe bzip files.

import gzip
import bz2

def open_by_suffix(filename):
    if filename.endswith('.gz'):
        return gzip.open(filename, 'rb')
    elif filename.endswith('.bz2'):
        return bz2.BZ2file(filename, 'r')
    else:
        return open(filename, 'r')

If we don't trust the filenames, we can compare the initial bytes of the file for magic strings (modified from https://stackoverflow.com/a/13044946/117714):

import gzip
import bz2

magic_dict = {
    "\x1f\x8b\x08": (gzip.open, 'rb')
    "\x42\x5a\x68": (bz2.BZ2File, 'r')
}
max_len = max(len(x) for x in magic_dict)

def open_by_magic(filename):
    with open(filename) as f:
        file_start = f.read(max_len)
    for magic, (fn, flag) in magic_dict.items():
        if file_start.startswith(magic):
            return fn(filename, flag)
    return open(filename, 'r')

Usage:

# cat
for filename in filenames:
    with open_by_suffix(filename) as f:
        for line in f:
            print f

Your use-case would look like:

for f in files:
    with open_by_suffix(f) as handle:
        process_file_contents(handle)
spazm
  • 4,399
  • 31
  • 30