Unzipping part of a .gz file using python

Question

So here's the problem. I have sample.gz file which is roughly 60KB in size. I want to decompress the first 2000 bytes of this file. I am running into CRC check failed error, I guess because the gzip CRC field appears at the end of file, and it requires the entire gzipped file to decompress. Is there a way to get around this? I don't care about the CRC check. Even if I fail to decompress because of bad CRC, that is OK. Is there a way to get around this and unzip partial .gz files?

The code I have so far is

import gzip
import time
import StringIO

file = open('sample.gz', 'rb')
mybuf = MyBuffer(file)
mybuf = StringIO.StringIO(file.read(2000))
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()
print data

The error encountered is

File "gunzip.py", line 27, in ?
    data = f.read()
File "/usr/local/lib/python2.4/gzip.py", line 218, in read
  self._read(readsize)
File "/usr/local/lib/python2.4/gzip.py", line 273, in _read
  self._read_eof()
File "/usr/local/lib/python2.4/gzip.py", line 309, in _read_eof
  raise IOError, "CRC check failed"
IOError: CRC check failed

Also is there any way to use zlib module to do this and ignore the gzip headers?

Cuz I am interested in the first maybe 4k of the compressed data. — user210126, Nov 14 '09 at 00:22

score 14 · Answer 1 · edited May 23 '17 at 12:25

The issue with the gzip module is not that it can't decompress the partial file, the error occurs only at the end when it tries to verify the checksum of the decompressed content. (The original checksum is stored at the end of the compressed file so the verification will never, ever work with a partial file.)

The key is to trick gzip into skipping the verification. The answer by caesar0301 does this by modifying the gzip source code, but it's not necessary to go that far, simple monkey patching will do. I wrote this context manager to temporarily replace gzip.GzipFile._read_eof while I decompress the partial file:

import contextlib

@contextlib.contextmanager
def patch_gzip_for_partial():
    """
    Context manager that replaces gzip.GzipFile._read_eof with a no-op.

    This is useful when decompressing partial files, something that won't
    work if GzipFile does it's checksum comparison.

    """
    _read_eof = gzip.GzipFile._read_eof
    gzip.GzipFile._read_eof = lambda *args, **kwargs: None
    yield
    gzip.GzipFile._read_eof = _read_eof

An example usage:

from cStringIO import StringIO

with patch_gzip_for_partial():
    decompressed = gzip.GzipFile(StringIO(compressed)).read()

AttributeError: type object 'GzipFile' has no attribute '_read_eof' — Marlon Teixeira, Jan 06 '23 at 13:36

score 12 · Accepted Answer · edited Aug 06 '18 at 10:35

I seems that you need to look into Python zlib library instead

The GZIP format relies on zlib, but introduces a file-level compression concept along with CRC checking, and this appears to be what you do not want/need at the moment.

See for example these code snippets from Dough Hellman

Edit: the code on Doubh Hellman's site only show how to compress or decompress with zlib. As indicated above, GZIP is "zlib with an envelope", and you'll need to decode the envellope before getting to the zlib-compressed data per se. Here's more info to go about it, it's really not that complicated:

see RFC 1952 for details about the GZIP format
This format starts with a 10 bytes header, followed by optional, non compressed elements such as the file name or a comment, followed by the zlib-compressed data, itself followed by a CRC-32 (precisely an "Adler32" CRC).
By using Python's struct module, parsing the header should be relatively simple
The zlib sequence (or its first few thousand bytes, since that is what you want to do) can then be decompressed with python's zlib module, as shown in the examples above
Possible problems to handle: if there are more than one file in the GZip archive, and if the second file starts within the block of a few thousand bytes we wish to decompress.

Sorry to provide neither an simple procedure nor a ready-to-go snippet, however decoding the file with the indication above should be relatively quick and simple.

@mjv... Which particular code snippet applies to the example above. I went through the link and read Working with Streams. Nowhere does it state that its working with gzip streams. I assume this works with zlib streams (have tested with zlib streams) — user210126, Nov 14 '09 at 00:35
@unknown: Check my edit; the code snippets pertain to the compressing/decompressing to/from pure zlib. The GZip format implies fist parsing a small, uncompressed header, before finding its zlip "payload" which can be decompressed as shown. — mjv, Nov 14 '09 at 05:35
The Doug Hellmann snippets appear to have moved [here](http://pymotw.com/3/zlib/index.html). — Ben, Mar 07 '23 at 17:21

score 11 · Answer 3 · answered Nov 14 '09 at 00:22

11

I can't see any possible reason why you would want to decompress the first 2000 compressed bytes. Depending on the data, this may uncompress to any number of output bytes.

Surely you want to uncompress the file, and stop when you have uncompressed as much of the file as you need, something like:

f = gzip.GzipFile(fileobj=open('postcode-code.tar.gz', 'rb'))
data = f.read(4000)
print data

AFAIK, this won't cause the whole file to be read. It will only read as much as is necessary to get the first 4000 bytes.

answered Nov 14 '09 at 00:22

rjmunro

27,203
20
110
132

f.read(2000) here will read the first 2000 bytes of decompressed data. I am interested in the first 2000 bytes of compressed data. – user210126 Nov 14 '09 at 00:25
Why? What on earth is your application? – rjmunro Nov 14 '09 at 00:27
:-) I am trying to find string "xyz" in the first 4k of data. Assuming I decompress 2K of gzipped data and land with 4K of decompressed data, I can search/grep in this 4k for the string. All the searching code is already in place.. – user210126 Nov 14 '09 at 00:31
Assume that all I am going to get it is first 2k of compressed data from a 60K .gz file. After that nothing. Nada. I need to *find* my string in the decompressed part of this 2k – user210126 Nov 14 '09 at 00:37
4

If you want to search the first 4k of uncompressed data, search the first 4k of uncompressed data, as I do in my answer (maybe change 4000 to 4096). Don't try to guess that 2k will unzip to 4k. It may not. It may only unzip to just 2k, or it might unzip to a couple of megabytes. – rjmunro May 28 '12 at 16:24
1

This is perfect. Thank you so much! No need for dirty hacks. – Marco Roy Nov 08 '17 at 00:45

caesar0301 · Answer 4 · 2013-05-12T06:23:12.623

I also encounter this problem when I use my python script to read compressed files generated by gzip tool under Linux and the original files were lost.

By reading the implementation of gzip.py of Python, I found that gzip.GzipFile had similar methods of File class and exploited python zip module to process data de/compressing. At the same time, the _read_eof() method is also present to check the CRC of each file.

But in some situations, like processing Stream or .gz file without correct CRC (my problem), an IOError("CRC check failed") will be raised by _read_eof(). Therefore, I try to modify the gzip module to disable the CRC check and finally this problem disappeared.

def _read_eof(self):
    pass

https://github.com/caesar0301/PcapEx/blob/master/live-scripts/gzip_mod.py

I know it's a brute-force solution, but it save much time to rewrite yourself some low level methods using the zip module, like of reading data chuck by chuck from the zipped files and extract the data line by line, most of which has been present in the gzip module.

Jamin

Unzipping part of a .gz file using python

4 Answers4

Linked