Python: Read data from files with sometimes are compressed and sometimes not

Question

I need to read binary data from some files that are normally compressed with gzip. I have managed to read the data by using the gzip module:

def decode(self, filename):
    with gzip.open(filename, 'rb') as f:
        # ReadData

However sometimes the files are not compressed in which case I get an IOError (because the file does not have the gzip header).

I could do something like:

try:
    f = gzip.open(filename, 'rb')
    f._read_gzip_header()
    f.rewind()
except IOError:
    f.close()
    f = open(filename, 'rb')

with f as gz:
    #ReadData

but I don't feel it is a good way to fix it.

I am looking for an elegant solution to solve this problem. I will write several "decode" functions for several file types. The solution I consider are to create a subclass of the GzipFile to deal with it but I believe there might be better ways.

I am using Python 2.7

Thank you in advance for any suggestion!

What's wrong with your solution? I wouldn't close the file and then immediately reopen it, but handling a potential `IOError` with try/except is exactly right. Just try to read it as gzip except `IOError` read as plain text except `IOError` traceback. Then do all your processing cleanly, with the processing code unaware of how the file was opened. — Two-Bit Alchemist, Mar 26 '14 at 22:05
Well, perhaps nothing is wrong with my solution (I am not so experienced in Python and sometimes I doubt what is good/wrong). The point why I did not like my solution was because I will write at least 6 decode functions (for 6 different file types) and I thought there might be a better way that I could use to avoid writing that try-except block in every function. But thanks for your comment. — user3466240, Mar 29 '14 at 07:12
If the issue is that you don't want to duplicate the code then just create a new contextmanager that yields a file-like object with uncompressed data whether input is compressed or not. Here's a [code example of a contextmanager `named_pipe()` that encapsulates creation of a named pipe](http://stackoverflow.com/a/22435492/4279) — jfs, Apr 01 '14 at 21:00

roippi · Answer 1 · 2014-03-27T14:40:09.427

1

You can check if the first two bytes are \x1f and \x8b, as per RFC 1952:

Member header and trailer

ID1 (IDentification 1) ID2 (IDentification 2) These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 (0x8b, \213), to identify the file as being in gzip format.

So for example,

with open('test.gz','rb') as f:
    print(f.read(2))

b'\x1f\x8b' #well, that was gzipped

with open('test','rb') as f:
    print(f.read(2))

b'he' #must not be gzipped

Presumably you would do some control flow based on those two bytes, then f.seek(0) and proceed accordingly.

But honestly? Your solution is fine (modulo the unnecessary close/reopen part). try/except is pythonic.

edited Mar 27 '14 at 14:40

answered Mar 26 '14 at 22:07

roippi

25,533
4
48
73

Thanks for your answer. (My comment to Two-Bit Alchemist applies also here). But I did not understand which part of my code was unnecessary: Don't I need to use close() if I don't use the "with" statement to open the file? – user3466240 Mar 29 '14 at 07:15
Yes, you should try to manually explicitly close file objects if not using the with statement as a context manager. However, writing `f.close` then `open(same_file)` back to back is as pointless as closing then re-opening a door you're about to exit through. :P – Two-Bit Alchemist Mar 29 '14 at 16:38

Python: Read data from files with sometimes are compressed and sometimes not

1 Answers1