1

I'm trying to read a gzip file from a url without saving a temporary file in Python 2.7. However, for some reason I get a truncated text file. I have spend quite some time searching the net for solutions without success. There is no truncation if I save the "raw" data back into a gzip file (see sample code below). What am I doing wrong?

My example code:

    import urllib2
    import zlib
    from StringIO import StringIO

    url = "ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/clinvar_00-latest.vcf.gz"

    # Create a opener
    opener = urllib2.build_opener() 

    request = urllib2.Request(url)
    request.add_header('Accept-encoding', 'gzip')

    # Fetch the gzip filer
    respond = opener.open(request)
    compressedData = respond.read()
    respond.close()

    opener.close()

    # Extract data and save to text file
    compressedDataBuf = StringIO(compressedData)
    d = zlib.decompressobj(16+zlib.MAX_WBITS)

    buffer = compressedDataBuf.read(1024)
    saveFile = open('/tmp/test.txt', "wb")
    while buffer:
        saveFile.write(d.decompress(buffer))
        buffer = compressedDataBuf.read(1024)
    saveFile.close()

    # Save "raw" data to new gzip file.
    saveFile = open('/tmp/test.gz', "wb")
    saveFile.write(compressedData)
    saveFile.close()
Magnus
  • 11
  • 2

1 Answers1

0

Because that gzip file consists of many concatenated gzip streams, as permitted by RFC 1952. gzip automatically decompresses all of the gzip streams.

You need to detect the end of each gzip stream and restart the decompression with the subsequent compressed data. Look at unused_data in the Python documentation.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158