39

I have a Python program which is going to take text files as input. However, some of these files may be gzip compressed.

Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?

Is the following reliable or could an ordinary text file 'accidentally' look gzip-like enough for me to get false positives?

try:
    gzip.GzipFile(filename, 'r')
    # compressed
    # ...
except:
    # not compressed
    # ...
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
Ryan Gabbard
  • 2,269
  • 2
  • 24
  • 37
  • 2
    Just a small hint... never rely on the file ending. See hop's answer for how to do it. – helpermethod Sep 13 '10 at 18:44
  • @Helper: i'm not sure (see my edit). you'd still have to deal with a possible IOError, but gzipped files without the suffix are broken, in my opinion… tough call :) –  Sep 13 '10 at 18:51

6 Answers6

45

The magic number for gzip compressed files is 1f 8b. Although testing for this is not 100% reliable, it is highly unlikely that "ordinary text files" start with those two bytes—in UTF-8 it's not even legal.

Usually gzip compressed files sport the suffix .gz though. Even gzip(1) itself won't unpack files without it unless you --force it to. You could conceivably use that, but you'd still have to deal with a possible IOError (which you have to in any case).

One problem with your approach is, that gzip.GzipFile() will not throw an exception if you feed it an uncompressed file. Only a later read() will. This means, that you would probably have to implement some of your program logic twice. Ugly.

  • gzip compressed files often have the .gz file extension (in fact, I don't think I've ever seen a .gzip extension), but it's generally unsafe to rely on file extension to test for the type of file anyhow. – CanSpice Sep 13 '10 at 18:51
  • Does it? - The gzip C library will transparently read uncompressed files. Although it will write files uncompressed it puts CRC codes through them to allow "gzip -t" (caught me out once) – Martin Beckett Sep 13 '10 at 18:53
  • @Martin: it does: $ gunzip foo gzip: foo: unknown suffix -- ignored –  Sep 13 '10 at 19:03
  • The c 'library' gzip, ie gzopen/gzread/etc will transparently read uncompressed files. They have an open compression=none mode which does NOT write unchanged flat files. – Martin Beckett Sep 13 '10 at 20:15
  • About extensions. You would also have to check for the relatively common `.tgz` extension. – mxmlnkn Mar 09 '19 at 18:20
43

Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not?

The accepted answer explains how one can detect a gzip compressed file in general: test if the first two bytes are 1f 8b. However it does not show how to implement it in Python.

Here is one way:

def is_gz_file(filepath):
    with open(filepath, 'rb') as test_f:
        return test_f.read(2) == b'\x1f\x8b'
  • 4
    Can be done without binascii as well: `test_f.read(2) == b'\x1f\x8b'` – nemetroid Oct 29 '20 at 11:08
  • 3
    For a lower false positive rate, you can test that the first _three_ bytes are `1f 8b 08`. – Mark Adler Jul 13 '21 at 18:14
  • 1
    If the file is not a `.gz` file, doesn't `test_f.read(2)` throw `OSError` in the first place? Is there still a need to check the bytes using `test_f.read(2) == b'\x1f\x8b'`? EDIT: it seems that this is only available since python 3.7. – Blade Nov 30 '21 at 17:43
15

Testing the magic number of a gzip file is the only reliable way to go. However, as of python3.7 there is no need to mess with comparing bytes yourself anymore. The gzip module will compare the bytes for you and raise an exception if they do not match!

As of python3.7, this works

import gzip
with gzip.open(input_file, 'r') as fh:
    try:
        fh.read(1)
    except OSError:
        print('input_file is not a valid gzip file by OSError')

As of python3.8, this also works:

import gzip
with gzip.open(input_file, 'r') as fh:
    try:
        fh.read(1)
    except gzip.BadGzipFile:
        print('input_file is not a valid gzip file by BadGzipFile')
winni2k
  • 1,460
  • 16
  • 19
2

gzip itself will raise an OSError if it's not a gzipped file.

>>> with gzip.open('README.md', 'rb') as f:
...     f.read()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/Users/dennis/.asdf/installs/python/3.6.6/lib/python3.6/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'# ')

Can combine this approach with some others to increase confidence, such as checking the mimetype or looking for a magic number in the file header (see other answers for an example) and checking the extension.

import pathlib

if '.gz' in pathlib.Path(filepath).suffixes:
   # some more inexpensive checks until confident we can attempt to decompress
   # ...
   try ...
     ...
   except OSError as e:
     ...
Dennis
  • 56,821
  • 26
  • 143
  • 139
  • 1
    python 3.8 now adds a more specific error called `gzip.BadGzipFile` for this purpose. This error still inherits from `OSError`. – winni2k Mar 11 '20 at 10:25
0

Import the mimetypes module. It can automatically guess what kind of file you have, and if it is compressed.

i.e.

mimetypes.guess_type('blabla.txt.gz')

returns:

('text/plain', 'gzip')

David Ries
  • 99
  • 1
  • 4
  • 24
    `mimetypes` only checks the end of the filename, it doesn't actually guess based on the content of the file. – Odinulf Aug 20 '13 at 19:44
0

Doesn’t seem to work well in python3...

import mimetypes
filename = "./datasets/test"

def file_type(filename):
    type = mimetypes.guess_type(filename)
    return type
print(file_type(filename))

returns (None, None) But from the unix command "File"

:~> file datasets/test datasets/test: gzip compressed data, was "iostat_collection", from Unix, last modified: Thu Jan 29 07:09:34 2015

ewr2san
  • 89
  • 1
  • 2
  • 3
    mimetypes uses juts the filename to guess the type. To detect a filetype from the raw file you will need to use the 'magic' module. – Brice M. Dempsey Apr 19 '16 at 08:30