Prevent GZipStream/DeflateStream from trying to consume more than the compressed data

Question

I have a file that could have been created something like this:

stream.Write(headerBytes, 0, headerBytes.Count);

using (var gz = new GZipStream(stream, Compress, leaveOpen: true);
{
    gz.Write(otherBytes, 0, otherBytes.Count);
}

stream.Write(moreBytes, 0, moreBytes.Count);

Now when reading the file like

stream.Read(headerBytes, 0, headerBytes.Count);
// in reality I make sure that indeed headerBytes.Count get read,
// something the above line omits

using (var gz = new GZipStream(stream, Decompress, leaveOpen: true)
{
  do { /* use buffer... */}
  while ((bytesRead = gz.Read(buffer, 0, buffer.Length)) != 0);
}

while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) != 0)
  // use buffer...

It turns out that the GZipStream (same is true for DeflateStream) reads in 16384 bytes from stream, instead of the actual 13293 compressed bytes in the case I checked.

Assuming I neither know the size of the compressed part of the file beforehand, nor the number of bytes following the compressed data, is there a way to use GzipStream/DeflateStream

so it only reads the compressed data from stream
or at least figure out what the size of the compressed data part was, so I can stream.Position -= actuallyRead - compressedSize manually?

score 1 · Answer 1 · edited May 23 '17 at 12:06

1

That interface does not appear to provide a means to do what you want, which is one of many reasons to not use .NET's GZipStream or DeflateStream.

You should use DotNetZip instead.

edited May 23 '17 at 12:06

Community

1
1

answered Mar 11 '15 at 18:57

Mark Adler

101,978
13
118
158

score 0 · Answer 2 · edited Oct 07 '21 at 06:26

This answer amounts to an ugly workaround. I don't particularly like it, but it does work (except when it doesn't) even if only for GZipStream.

or at least figure out what the size of the compressed data part was, so I can stream.Position -= actuallyRead - compressedSize manually?

As every gzip file (and in fact every gzip member) ends with

     +---+---+---+---+---+---+---+---+
     |     CRC32     |     ISIZE     |
     +---+---+---+---+---+---+---+---+

     CRC32
        This contains a Cyclic Redundancy Check value of the
        uncompressed data

     ISIZE
        This contains the size of the original (uncompressed) input
        data modulo 2^32.

I could just use the uncompressed size (module 2^32), which I know after closing the GzipStream, and seek backwards in the stream until I find those 4 bytes matching it.

To make it more robust, I should also calculate the CRC32 while uncompressing, and seek backwards in the stream to right after the 8 bytes forming the correct CRC32 and ISIZE.

Ugly, but I did warn you.

<sarcasm>How I love encapsulation. Encapsulating all the useful stuff away, leaving us with a decompressing Stream that works in exactly the one use case the all-knowing API designer foresaw.</sarcasm>

Here's a quick SeekBack implementation that works so far:

/// <returns>the number of bytes sought back (including bytes.Length)
///          or 0 in case of failure</returns>
static int SeekBack(Stream s, byte[] bytes, int maxSeekBack)
{
    if (maxSeekBack != -1 && maxSeekBack < bytes.Length)
        throw new ArgumentException("maxSeekBack must be >= bytes.Length");

    int soughtBack = 0;
    for (int i = bytes.Length - 1; i >= 0; i--)
    {
        while ((maxSeekBack == -1 || soughtBack < maxSeekBack)
               && s.Position > i)
        {
            s.Position -= 1;
            // as we are seeking back, the following will never become
            // -1 (EOS), so coercing to byte is OK
            byte b = (byte)s.ReadByte();
            s.Position -= 1;
            soughtBack++;
            if (b == bytes[i])
            {
                if (i == 0)
                    return soughtBack;
                break;
            }
            else
            {
                var bytesIn = (bytes.Length - 1) - i;
                if (bytesIn > 0) // back to square one
                {
                    soughtBack -= bytesIn;
                    s.Position += bytesIn;
                    i = bytes.Length - 1;
                }
            }
        }
    }
    // no luck? return to original position
    s.Position += soughtBack;
    return 0;
}

score 0 · Answer 3 · answered Mar 13 '15 at 02:05

Following Mark Adler's suggestion, I tried DotNetZip, and lo and behold, its GZipStream.Position property does not only not throw, it even returns the number of actual gzip bytes read in (plus 8, for some reason that I still have to figure out).

So it does read more than strictly necessary, but it lets me calculate how much to backtrack.

The following works for me:

var posBefore = fileStream.Position;
long compressedBytesRead;
using (var gz = new GZipStream(fileStream, CompressionMode.Decompress, true))
{
    while (gz.Read(buffer, 0, buffer.Length) != 0)
        ; // use it!
    compressedBytesRead = gz.Position;
}
var gzipStreamAdvance = fileStream.Position - posBefore;
var seekBack = gzipStreamAdvance - compressedBytesRead - 8; // but why "- 8"?
fileStream.Position -= seekBack;

Prevent GZipStream/DeflateStream from trying to consume more than the compressed data

3 Answers3