2

I'm in my 2nd week of Python and I'm stuck on a directory of zipped/unzipped logfiles, which I need to parse and process.

Currently I'm doing this:

import os
import sys
import operator
import zipfile
import zlib
import gzip
import subprocess

if sys.version.startswith("3."):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

for f in glob.glob('logs/*'):
    file = open(f,'rb')        
    new_file_name = f + "_unzipped"
    last_pos = file.tell()

    # test for gzip
    if (file.read(2) == b'\x1f\x8b'):
        file.seek(last_pos)

    #unzip to new file
    out = open( new_file_name, "wb" )
    process = subprocess.Popen(["zcat", f], stdout = subprocess.PIPE, stderr=subprocess.STDOUT)

    while True:
      if process.poll() != None:
        break;

    output = io_method(process.communicate()[0])
    exitCode = process.returncode


    if (exitCode == 0):
      print "done"
      out.write( output )
      out.close()
    else:
      raise ProcessException(command, exitCode, output)

which I've "stitched" together using these SO answers (here) and blogposts (here)

However, it does not seem to work, because my test file is 2.5GB and the script has been chewing on it for 10+mins plus I'm not really sure if what I'm doing is correct anyway.

Question:
If I don't want to use GZIP module and need to de-compress chunk-by-chunk (actual files are >10GB), how do I uncompress and save to file using zcat and subprocess in Python?

Thanks!

Community
  • 1
  • 1
frequent
  • 27,643
  • 59
  • 181
  • 333
  • I am unclear on what your goal is. Are you trying to uncompress all of the files in a directory? Equivalent to: `gunzip *.gz` ? Do you any specific objection to using the gzip module? – Robᵩ Mar 11 '13 at 14:29
  • The directory contains zipped and unzipped files. I need to process both in a single process, so my idea was to (1) first run over the directory, (2) pick zipped files and unzip to new file (3) then do a 2nd run to process. Not sure if this is the best way to go though – frequent Mar 11 '13 at 14:31
  • 1
    re: objection `gzip`, isn't it, that gzip is very slow - like mentioned [here](http://codebright.wordpress.com/2011/03/25/139/)? – frequent Mar 11 '13 at 14:32
  • Do you need to seek on the logfiles, or is reading them in one pass sufficient? – Robᵩ Mar 11 '13 at 14:40
  • 1
    I need to retrieve the first log entry (line) per file (zipped/unzipped), extract the date and store it with filepath. Next pass will be processing log files line-by-line in sorted order. – frequent Mar 11 '13 at 14:44
  • You probably need a buffer size for your `Popen` call. – hughdbrown Mar 11 '13 at 14:52
  • Okay. I'm not sure what you are trying to accomplish with `_unzipped`, `.poll` or `io_method`. Perhaps there is something subtle that I've missed. I posted the obvious answer below. – Robᵩ Mar 11 '13 at 14:53

1 Answers1

2

This should read the first line of every file in the logs subdirectory, unzipping as required:

#!/usr/bin/env python

import glob
import gzip
import subprocess

for f in glob.glob('logs/*'):
  if f.endswith('.gz'):
    # Open a compressed file. Here is the easy way:
    #   file = gzip.open(f, 'rb')
    # Or, here is the hard way:
    proc = subprocess.Popen(['zcat', f], stdout=subprocess.PIPE)
    file = proc.stdout
  else:
    # Otherwise, it must be a regular file
    file = open(f, 'rb')

  # Process file, for example:
  print f, file.readline()
Robᵩ
  • 163,533
  • 20
  • 239
  • 308