1

Reading in a batch of syslog files. Some of them have been compressed:

syslog
syslog.1
syslog.2.gz
syslog.3.gz
syslog.4.gz
with open(args.filename_path, "rb") as file:

Everything seems to work ok when using gzip.open in the with statement when the extension is gz:

with gzip.open(args.filename_path, "rb") as file:

How to get the gzip.open when it's a gz file but use open when it's a "normal" file.

gm3dmo
  • 345
  • 5
  • 17
  • Do you iterate and check the file endings to use either `with open(file)` or `with gz.open(file)`? – m.i.cosacak May 03 '21 at 15:42
  • 1
    I'm confused. You say everything works using gzip.open then ask how to get gzip.open to work with a gz file. Is the question how to know when to use open verses gzip.open? – tdelaney May 03 '21 at 15:46

3 Answers3

4

One way to do this is check if the file is gzipped and if it is, use gzip.open, otherwise use open.

import gzip

paths = ["foo.log", "bar.log.gz"]

def is_gzipped(path):
    return path.endswith(".gz")

for path in paths:
    open_fn = gzip.open if is_gzipped(path) else open
    with open_fn(path) as f:
        pass  # do things with file

A better way to check if a file is gzipped is to query the "magic numbers" and see if they match those of gzip. See https://stackoverflow.com/a/47080739/5666087 for more information.

def is_gzipped(path):
    with open(path, "rb") as f:
        return f.read(2) == b'\x1f\x8b'
jkr
  • 17,119
  • 2
  • 42
  • 68
  • It may also be better to have `is_gzipped` attempt `gzip.open` on the file and catch the exception. Then you are not extension dependent. – tdelaney May 03 '21 at 15:58
  • @tdelaney - good idea. what do you think about checking the magic numbers? – jkr May 03 '21 at 16:05
  • That's worked nicely thank you @tdelaney. Apologies, reading it back I wasn't absolutely clear in my original description, my problem was to make `with` happen with `gzip.open` when the file is compressed. – gm3dmo May 03 '21 at 16:11
  • I like the "magic" number suggestion also and will work that into my script too. – gm3dmo May 03 '21 at 16:14
  • @jakob I like the magic number test especially if you want to expand the number of compressors supported. Magic number isn't fool proof. You'll still want to wrap the gzip open in an exception handler just in case. And since you are doing that anyway, why bother with the magic number. – tdelaney May 04 '21 at 00:17
  • @tdelaney - interestingly, i was exploring the gzip module and it raises an exception if the magic numbers are not correct. https://github.com/python/cpython/blob/00726e51ade10c7e3535811eb700418725244230/Lib/gzip.py#L434-L435 – jkr May 04 '21 at 12:56
2

The hook_compressed() of the standard module fileinput that does exactly what you are asking for:

Transparently opens files compressed with gzip and bzip2 (recognized by the extensions .gz and .bz2) using the gzip and bz2 modules. If the filename extension is not .gz or .bz2, the file is opened normally (ie, using open() without any decompression).

with fileinput.hook_compressed("test.txt.gz") as f:
   f.read()
with fileinput.hook_compressed("test.txt") as f:
    f.read()
Alexander Solovets
  • 2,447
  • 15
  • 22
1

It should work well. Here is an example for you:

In the OS:

cat test.txt
foo
bar

gzip test.txt

Even if the extension is not gz:

In the OS:

mv test.txt.gz test.txt

In Python: with gzip.open("test.txt") as f: print(f.read())

foo
bar

In Python:

with gzip.open("test.txt.gz") as f:
     print(f.read())
    
foo
bar

However, if the file is not gzip compressed, then you will get an exception IOError: Not a gzipped file.

quantummind
  • 2,086
  • 1
  • 14
  • 20