How to test a directory of files for gzip and uncompress gzipped files in Python using zcat?

Question

I'm in my 2nd week of Python and I'm stuck on a directory of zipped/unzipped logfiles, which I need to parse and process.

Currently I'm doing this:

import os
import sys
import operator
import zipfile
import zlib
import gzip
import subprocess

if sys.version.startswith("3."):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

for f in glob.glob('logs/*'):
    file = open(f,'rb')        
    new_file_name = f + "_unzipped"
    last_pos = file.tell()

    # test for gzip
    if (file.read(2) == b'\x1f\x8b'):
        file.seek(last_pos)

    #unzip to new file
    out = open( new_file_name, "wb" )
    process = subprocess.Popen(["zcat", f], stdout = subprocess.PIPE, stderr=subprocess.STDOUT)

    while True:
      if process.poll() != None:
        break;

    output = io_method(process.communicate()[0])
    exitCode = process.returncode


    if (exitCode == 0):
      print "done"
      out.write( output )
      out.close()
    else:
      raise ProcessException(command, exitCode, output)

which I've "stitched" together using these SO answers (here) and blogposts (here)

However, it does not seem to work, because my test file is 2.5GB and the script has been chewing on it for 10+mins plus I'm not really sure if what I'm doing is correct anyway.

Question:
If I don't want to use GZIP module and need to de-compress chunk-by-chunk (actual files are >10GB), how do I uncompress and save to file using zcat and subprocess in Python?

Thanks!

I am unclear on what your goal is. Are you trying to uncompress all of the files in a directory? Equivalent to: `gunzip *.gz` ? Do you any specific objection to using the gzip module? — Robᵩ, Mar 11 '13 at 14:29
The directory contains zipped and unzipped files. I need to process both in a single process, so my idea was to (1) first run over the directory, (2) pick zipped files and unzip to new file (3) then do a 2nd run to process. Not sure if this is the best way to go though — frequent, Mar 11 '13 at 14:31
re: objection `gzip`, isn't it, that gzip is very slow - like mentioned [here](http://codebright.wordpress.com/2011/03/25/139/)? — frequent, Mar 11 '13 at 14:32
Do you need to seek on the logfiles, or is reading them in one pass sufficient? — Robᵩ, Mar 11 '13 at 14:40
I need to retrieve the first log entry (line) per file (zipped/unzipped), extract the date and store it with filepath. Next pass will be processing log files line-by-line in sorted order. — frequent, Mar 11 '13 at 14:44
Okay. I'm not sure what you are trying to accomplish with `_unzipped`, `.poll` or `io_method`. Perhaps there is something subtle that I've missed. I posted the obvious answer below. — Robᵩ, Mar 11 '13 at 14:53

score 2 · Accepted Answer · answered Mar 11 '13 at 14:49

This should read the first line of every file in the logs subdirectory, unzipping as required:

#!/usr/bin/env python

import glob
import gzip
import subprocess

for f in glob.glob('logs/*'):
  if f.endswith('.gz'):
    # Open a compressed file. Here is the easy way:
    #   file = gzip.open(f, 'rb')
    # Or, here is the hard way:
    proc = subprocess.Popen(['zcat', f], stdout=subprocess.PIPE)
    file = proc.stdout
  else:
    # Otherwise, it must be a regular file
    file = open(f, 'rb')

  # Process file, for example:
  print f, file.readline()

How to test a directory of files for gzip and uncompress gzipped files in Python using zcat?

1 Answers1