There has been similar questions asked (and answered), but never really together, and I can't seem to get anything to work. Since I am just starting with Python, something easy to understand would be great!
I have 3 large data files (>500G) that I need to unzip, concatenate, pipe it to a subprocess, then pipe that output to another subprocess. I then need to process that final output which I would like to do in Python. Note I do not need the unzipped and/or concatenated file except for the processing - creating one I think would be a waste of space. Here is what I have so far...
import gzip
from subprocess import Popen, PIPE
#zipped files
zipfile1 = "./file_1.txt.gz"
zipfile2 = "./file_2.txt.gz"
zipfile3 = "./file_3.txt.gz"
# Open the first pipe
p1 = Popen(["dataclean.pl"], stdin=PIPE, stdout=PIPE)
# Unzip the files and pipe them in (has to be a more pythonic way to do it -
# if this is even correct)
unzipfile1 = gzip.open(zipfile1, 'wb')
p1.stdin.write(unzipfile1.read())
unzipfile1.close()
unzipfile2 = gzip.open(zipfile2, 'wb')
p1.stdin.write(unzipfile2.read())
unzipfile2.close()
unzipfile3 = gzip.open(zipfile3, 'wb')
p1.stdin.write(unzipfile3.read())
unzipfile3.close()
# Pipe the output of p1 to p2
p2 = Popen(["dataprocess.pl"], stdin=p1.stdout, stdout=PIPE)
# Not sure what this does - something about a SIGPIPE
p1.stdout.close()
## Not sure what this does either - but it is in the pydoc
output = p2.communicate()[0]
## more processing of p2.stdout...
print p2.stdout
Any suggestions would be greatly appreciated. *As a bonus question...the pydoc for read() says this:
"Also note that when in non-blocking mode, less data than what was requested may be returned, even if no size parameter was given."
That seems scary. Can anyone interpret it? I don't want to read in only part of a dataset thinking it is the whole thing. I thought leaving the size of the file was a good thing, especially when I don't know the size of the file.
Thanks,
GK