2

Ok, so I have a zip file that contains gz files (unix gzip).

Here's what I do --

def parseSTS(file):
    import zipfile, re, io, gzip
    with zipfile.ZipFile(file, 'r') as zfile:
        for name in zfile.namelist():
            if re.search(r'\.gz$', name) != None:
                zfiledata = zfile.open(name)
                print("start for file ", name)
                with gzip.open(zfiledata,'r') as gzfile:
                    print("done opening")
                    filecontent = gzfile.read()
                    print("done reading")
                    print(filecontent)  

This gives the following result --

>>> 
start for file  XXXXXX.gz
done opening
done reading

Then stays like that forever until it crashes ...

What can I do with filecontent?

Edit : this is not a duplicate since my gzipped files are in a zipped file and i'm trying to avoid extracting that zip file to disk. It works with zip files in a zip file as per How to read from a zip file within zip file in Python? .

Community
  • 1
  • 1
zlr
  • 789
  • 11
  • 22
  • Where does it crash? Can you give us a stack trace? – loopbackbee Nov 20 '13 at 15:41
  • 2
    also, you should use `name.endswith(".gz")` instead of `re.search(r'\.gz$', name)`. Using regex for this is shooting a mouse with an elephant gun – loopbackbee Nov 20 '13 at 15:43
  • 1
    Use 'rb' instead of 'r' while reading the file. – Ankit Jaiswal Nov 20 '13 at 15:45
  • @goncalopp : tru dat ! i'll correct it ! changing to rb doesn't change anything The python shell stalls and gets in the "non responding" state. How can i get a stack trace ? Can i break it ? – zlr Nov 20 '13 at 15:51
  • i could extract everything then open, then delete the files but i would prefere to do it all in memory. – zlr Nov 20 '13 at 15:53

2 Answers2

1

I created a zip file containing a gzip'ed PDF file I grabbed from the web.

I ran this code (with two small changes):

1) Fixed indenting of everything under the def statement (which I also corrected in your Question because I'm sure that it's right on your end or it wouldn't get to the problem you have).

2) I changed:

            zfiledata = zfile.open(name)
            print("start for file ", name)
            with gzip.open(zfiledata,'r') as gzfile:
                print("done opening")
                filecontent = gzfile.read()
                print("done reading")
                print(filecontent)  

to:

            print("start for file ", name)
            with gzip.open(name,'rb') as gzfile:
                print("done opening")
                filecontent = gzfile.read()
                print("done reading")
                print(filecontent)  

Because you were passing a file object to gzip.open instead of a string. I have no idea how your code is executing without that change, but it was crashing for me until I fixed it.

EDIT: Adding link to GZIP docs from James R's answer --

Also, see here for further documentation:

http://docs.python.org/2/library/gzip.html#examples-of-usage

END EDIT

Now, since my gzip'ed file is small, the behavior I observe is that is pauses for about 3 seconds after printing done reading, then outputs what is in filecontent.

I would suggest adding the following debugging line after your print "done reading" -- print len(filecontent). If this number is very, very large, consider not printing the entire file contents in one shot.

I would also suggest reading this for more insight into what I expect is your problem: Why is printing to stdout so slow? Can it be sped up?

EDIT 2 - an alternative if your system does not handle file io on zip files, causing no such file errors in the above:

def parseSTS(afile):
    import zipfile
    import zlib
    import gzip
    import io
    with zipfile.ZipFile(afile, 'r') as archive:
        for name in archive.namelist():
            if name.endswith('.gz'):
                    bfn = archive.read(name)
                    bfi = io.BytesIO(bfn)
                    g = gzip.GzipFile(fileobj=bfi,mode='rb')
                    qqq = g.read()
                    print qqq

parseSTS('t.zip')
Community
  • 1
  • 1
selllikesybok
  • 1,250
  • 11
  • 17
  • If I run it from the python interpreter via PowerShell the output is about .2 seconds to complete, instead of ~3 seconds in IDLE (where I was running it the first time). So, as noted, if it's not your code, it could be your terminal. – selllikesybok Nov 20 '13 at 16:16
  • I suppose I also didn't explicitly mention the change from `r` to `rb` above, but it should be made. However, on my test with the random PDF, the outcome was not impacted (both had the same behavior). – selllikesybok Nov 20 '13 at 16:30
  • this doesn't work : my gziped files are in a zip file so i get the logical following error : FileNotFoundError: [Errno 2] No such file or directory: XXXXXX.gz' For this to work i would need to first unzip the zipfile then extract the gz then do some cleanups. I find it strange that there's no way to have a file extracted from a zipfile stored in memory then passed to gzip – zlr Nov 20 '13 at 17:51
  • The underlying file-handling on my end makes the suggestion work fine, but assuming yours does not, I think the latest edit will do it all in memory. Use zipfile.read method to get bytes of gz file, make a buffered stream, pass that to GzipFile constructor in binary read mode, and we can read/print the decompressed contents of the gz file. – selllikesybok Nov 20 '13 at 22:12
  • yes, this works ! even without using the advanced mode for gzip ie: `gzo = gzip.open(bfi,'rb')` works. i also want to point out that your remark about STDOUT applies fully as i needed to redirect to a file to see it work. Lastly I have a remaining problem : the content of the file appears with \n, like a unix file that was opened on windows (which it is) hence i fail to do something like: `for line in qqq:` but that's another problem !! thanks for your help ! – zlr Nov 20 '13 at 23:01
0

Most likely your problem lies here:

       if name.endswith(".gz"): #as goncalopp said in the comments, use endswith
            #zfiledata = zfile.open(name) #don't do this
            #print("start for file ", name)
            with gzip.open(name,'rb') as gzfile: #gz compressed files should be read in binary and gzip opens the files directly
                #print("done opening") #trust in your program, luke
                filecontent = gzfile.read()
                #print("done reading")
                print(filecontent)

See here for further documentation:

http://docs.python.org/2/library/gzip.html#examples-of-usage

James R
  • 4,571
  • 3
  • 30
  • 45
  • If OP is executing the code in the question, how on earth is no TypeError being raised? It should simply not run like this, unless somehow, bizarrely, the file object evaluates to a string that is, in turn, a valid filename? – selllikesybok Nov 20 '13 at 16:11
  • 1
    @selllikesybok You mean him passing in an opened file? Honestly I don't really know. I only tested the above (in python 2.7 mind you, and he's running in 3.x). Anyway, it looks like we arrived at about generally the same answer. If you add the gzip documenation to yours, i'll retract mine – James R Nov 20 '13 at 16:19
  • True, I am running this in 2.7 also, so perhaps a bizarre issue with GZIP in 3.x? Anyway, I have added the link in my answer, with a nod to you for good measure. – selllikesybok Nov 20 '13 at 16:25
  • @selllikesybok I ran it in 3.3 just now and it through a UnicodeDecodeError: (when passing in an opened file in "r" mode (as he is doing) so I agree, he shouldn't even be getting as far as he is in the code. And thank you, I rec'd your answer – James R Nov 20 '13 at 16:26
  • So apparently you cannot do this and the gzip'd files from the zip needs to be extracted first, then parsed with gzip. Lots of disk writes for nothing if you have a whole lot of zipped gziped files... – zlr Nov 20 '13 at 17:57
  • 1
    @zlr did you try using StringBuffer? It's a file like object that should work in this case. Extract to it, then pass to gzip? – James R Nov 20 '13 at 18:08
  • 2
    so as per @selllikesybok answer's the use of `io.BytesIO` (akin to StringBuffer) makes it work indeed :) – zlr Nov 20 '13 at 23:03