Concatenating large files, piping, and a bonus

Question

There has been similar questions asked (and answered), but never really together, and I can't seem to get anything to work. Since I am just starting with Python, something easy to understand would be great!

I have 3 large data files (>500G) that I need to unzip, concatenate, pipe it to a subprocess, then pipe that output to another subprocess. I then need to process that final output which I would like to do in Python. Note I do not need the unzipped and/or concatenated file except for the processing - creating one I think would be a waste of space. Here is what I have so far...

import gzip
from subprocess import Popen, PIPE

#zipped files
zipfile1 = "./file_1.txt.gz"   
zipfile2 = "./file_2.txt.gz"  
zipfile3 = "./file_3.txt.gz"


# Open the first pipe
p1 = Popen(["dataclean.pl"], stdin=PIPE, stdout=PIPE)

# Unzip the files and pipe them in (has to be a more pythonic way to do it - 
# if this is even correct)
unzipfile1 = gzip.open(zipfile1, 'wb')
p1.stdin.write(unzipfile1.read())
unzipfile1.close()

unzipfile2 = gzip.open(zipfile2, 'wb')
p1.stdin.write(unzipfile2.read())
unzipfile2.close()

unzipfile3 = gzip.open(zipfile3, 'wb')
p1.stdin.write(unzipfile3.read())
unzipfile3.close()


# Pipe the output of p1 to p2
p2 = Popen(["dataprocess.pl"], stdin=p1.stdout, stdout=PIPE)

# Not sure what this does - something about a SIGPIPE
p1.stdout.close()

## Not sure what this does either - but it is in the pydoc
output = p2.communicate()[0]

## more processing of p2.stdout...
print p2.stdout

Any suggestions would be greatly appreciated. *As a bonus question...the pydoc for read() says this:

"Also note that when in non-blocking mode, less data than what was requested may be returned, even if no size parameter was given."

That seems scary. Can anyone interpret it? I don't want to read in only part of a dataset thinking it is the whole thing. I thought leaving the size of the file was a good thing, especially when I don't know the size of the file.

Thanks,

GK

Are you sure that you want to use Python to deal with over a terabyte of data? Unzipping, concatenating, and piping are right up the alley of a shell script or batch file. — Adam Mihalcin, Mar 27 '12 at 01:52
I would try to avoid loading that much data at a time. What exactly are you trying to do with the data? You could probably accomplish this with a series of generators. — Joel Cornett, Mar 27 '12 at 01:55
Currently it is done with bash script with a call to a perl script for some data cleaning and then to a C++ script for some analysis (very large fMRI files). I was trying to add some more functionality to the original bash script, but it was getting a bit long and tedious. I figured I'd give python a go. Sounds like it is a bad idea? — user1294223, Mar 27 '12 at 02:10
Don't worry about non-blocking mode -- you would know if you were using it, as it requires a very different style of programming. — sarnold, Mar 27 '12 at 02:22

score 4 · Accepted Answer · answered Mar 27 '12 at 02:21

First things first; I think you've got your modes incorrect:

unzipfile1 = gzip.open(zipfile1, 'wb')

This should open zipfile1 for writing, not reading. I hope your data still exists.

Second, you do not want to try to work with the entire data all at once. You should work with the data in blocks of 16k or 32k or something. (The optimum size will vary based on many factors; make it configurable if this task has to be done many times, so you can time different sizes.)

What you're looking for is probably more like this untested pseudo-code:

while (block = unzipfile1.read(4096*4)):
    p1.stdin.write(a)

If you're trying to hook together multiple processes in a pipeline in Python, then it'll probably look more like this:

while (block = unzipfile1.read(4096*4)):
    p1.stdin.write(a)
    p2.stdin.write(p1.stdout.read())

This gives the output from p1 to p2 as quickly as possible. I've made the assumption that p1 won't generate significantly more input than it was given. If the output of p1 will be ten times greater than the input, then you should make another loop similar to this one.

But, I've got to say, this feels like a lot of extra work to replicate the shell script:

gzip -cd file1.gz file2.gz file3.gz | dataclean.py | dataprocess.pl

gzip(1) will automatically handle the block-sized data transfer as I've described above, and assuming your dataclean.py and dataprocess.pl scripts also work with data in blocks rather than performing full reads (as your original version of this script does), then they should all run in parallel near the best of their abilities.

Most of the stuff I wanted Python to do was to select the certain files I wanted (depending on day of the week, file availability, etc.). Also, wanted to call different processing scripts, depending on what I wanted to do, and again, the availability of files. I guess I can do that all with Python, create a string, then call an os.system(). Would that be a better idea? — user1294223, Mar 27 '12 at 02:38
If by `os.system()` [you meant `subprocess.call()`](http://stackoverflow.com/questions/204017/how-do-i-execute-a-program-from-python-os-system-fails-due-to-spaces-in-path), then yes. ;) This is a sane way to do it. — Li-aung Yip, Mar 27 '12 at 04:31
If you're going to do more with your script, then yes, that makes good sense. Li-Aung's advice to use `subprocess.call()` is well worth heeding. :) — sarnold, Mar 27 '12 at 22:25

Concatenating large files, piping, and a bonus

1 Answers1

Linked