5

I am trying to stream data through a subprocess, gzip it and write to a file. The following works. I wonder if it is possible to use python's native gzip library instead.

fid = gzip.open(self.ipFile, 'rb') # input data
oFid = open(filtSortFile, 'wb') # output file
sort = subprocess.Popen(args="sort | gzip -c ", shell=True, stdin=subprocess.PIPE, stdout=oFid) # set up the pipe
processlines(fid, sort.stdin, filtFid) # pump data into the pipe

THE QUESTION: How do I do this instead .. where the gzip package of python is used? I'm mostly curious to know why the following gives me a text files (instead of a compressed binary version) ... very odd.

fid = gzip.open(self.ipFile, 'rb')
oFid = gzip.open(filtSortFile, 'wb')
sort = subprocess.Popen(args="sort ", shell=True, stdin=subprocess.PIPE, stdout=oFid)
processlines(fid, sort.stdin, filtFid)
Brent Worden
  • 10,624
  • 7
  • 52
  • 57
fodon
  • 4,565
  • 12
  • 44
  • 58

3 Answers3

6

subprocess writes to oFid.fileno() but gzip returns fd of underlying file object:

def fileno(self):
    """Invoke the underlying file object's fileno() method."""
    return self.fileobj.fileno()

To enable compression use gzip methods directly:

import gzip
from subprocess import Popen, PIPE
from threading import Thread

def f(input, output):
    for line in iter(input.readline, ''):
        output.write(line)

p = Popen(["sort"], bufsize=-1, stdin=PIPE, stdout=PIPE)
Thread(target=f, args=(p.stdout, gzip.open('out.gz', 'wb'))).start()

for s in "cafebabe":
    p.stdin.write(s+"\n")
p.stdin.close()

Example

$ python gzip_subprocess.py  && od -c out.gz && zcat out.gz 
0000000 037 213  \b  \b 251   E   t   N 002 377   o   u   t  \0   K 344
0000020   J 344   J 002 302   d 256   T       L 343 002  \0   j 017   j
0000040   k 020  \0  \0  \0
0000045
a
a
b
b
c
e
e
f
Mechanical snail
  • 29,755
  • 14
  • 88
  • 113
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • I liked the elegance of this solution. However, as I test with a file with 0.8M lines (3.5M compressed), this method takes 35s or so more than the old method. In fact the time it takes for the pipe upto the input of the gziping thread is the same as the first method takes to completion. Seems a bit strange for a piping solution? – fodon Sep 17 '11 at 19:33
  • @fodon: assign `bufsize` to use buffering – jfs Sep 17 '11 at 19:46
  • can you give an example, I'm not sure how it fits in. – fodon Sep 17 '11 at 21:56
2

Since you just specify the file handle to give to the process you're executing, there are no further methods involved of the file object. To work around it, you could write your output to a pipe and read from that like so:

oFid = gzip.open(filtSortFile, 'wb')
sort = subprocess.Popen(args="sort ", shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
oFid.writelines(sort.stdout)
oFid.close()
steabert
  • 6,540
  • 2
  • 26
  • 32
  • What if the stream being gzipped is in the GB? There needs to be a process that generates the data, which is called before the deadlines method. Where will that data reside after being generated and before the write lines is called? – fodon Sep 17 '11 at 13:24
  • euhm? I don't really understand what you're trying to say. – steabert Sep 17 '11 at 14:51
  • Your code should work. There is a nuance though. I added some code and comments to the original problem. Should your writelines() method would be called after processlines()? If so, where would this data sit once the lines have been generated in processlines() and before writelines() is called? One advantage of piping is that it eliminates read/write cycle ... and becomes significant in my case with data of the order of 10GB. It is not clear to me that this advantage is maintained in this code ... what do you think? – fodon Sep 17 '11 at 18:32
  • I tried to answer your problem which was related to not understanding why you got just text instead of a gzipped file, the point being that you shouldn't just pass the gzip fd to the subprocess. As for your extra questions: the other answer already provided the solutions for that (my example doesn't take into account the input pipe). – steabert Sep 17 '11 at 18:51
0

Yes, it is possible to use python's native gzip library instead. I recommend looking at this question: gzip a file in Python.

I'm now using Jace Browning's answer:

with open('path/to/file', 'rb') as src, gzip.open('path/to/file.gz', 'wb') as dst:
    dst.writelines(src)

Although one comments raises you have to convert the src content to bytes, it is not required with this code.