5

I'm using GZipStream to compress / decompress data. I chose this over DeflateStream since the documentation states that GZipStream also adds a CRC to detect corrupt data, which is another feature I wanted. My "positive" unit tests are working well in that I can compress some data, save the compressed byte array and then successfully decompress it again. The .NET GZipStream compress and decompress problem post helped me realize that I needed to close the GZipStream before accessing the compressed or decompressed data.

Next, I continued to write a "negative" unit test to be sure corrupt data could be detected. I had previously used the example for the GZipStream class from MSDN to compress a file, open the compressed file with a text editor, change a byte to corrupt it (as if opening it with a text editor wasn't bad enough!), save it and then decompress it to be sure that I got an InvalidDataException as expected.

When I wrote the unit test, I picked an arbitrary byte to corrupt (e.g., compressedDataBytes[50] = 0x99) and got an InvalidDataException. So far so good. I was curious, so I chose another byte, but to my surprise I did not get an exception. This may be okay (e.g., I coincidentally hit an unused byte in a block of data), so long as the data could still be recovered successfully. However, I didn't get the correct data back either!

To be sure "it wasn't me", I took the cleaned up code from the bottom of .NET GZipStream compress and decompress problem and modified it to sequentially corrupt each byte of the compressed data until it failed to decompress properly. Here's the changes (note that I'm using the Visual Studio 2010 test framework):

// successful compress / decompress example code from:
//    https://stackoverflow.com/questions/1590846/net-gzipstream-compress-and-decompress-problem
[TestMethod]
public void Test_zipping_with_memorystream_and_corrupting_compressed_data()
{
   const string sample = "This is a compression test of microsoft .net gzip compression method and decompression methods";
   var encoding = new ASCIIEncoding();
   var data = encoding.GetBytes(sample);
   string sampleOut = null;
   byte[] cmpData;

   // Compress 
   using (var cmpStream = new MemoryStream())
   {
      using (var hgs = new GZipStream(cmpStream, CompressionMode.Compress))
      {
         hgs.Write(data, 0, data.Length);
      }
      cmpData = cmpStream.ToArray();
   }

   int corruptBytesNotDetected = 0;

   // corrupt data byte by byte
   for (var byteToCorrupt = 0; byteToCorrupt < cmpData.Length; byteToCorrupt++)
   {
      // corrupt the data
      cmpData[byteToCorrupt]++;

      using (var decomStream = new MemoryStream(cmpData))
      {
         using (var hgs = new GZipStream(decomStream, CompressionMode.Decompress))
         {
            using (var reader = new StreamReader(hgs))
            {
               try
               {
                  sampleOut = reader.ReadToEnd();

                  // if we get here, the corrupt data was not detected by GZipStream
                  // ... okay so long as the correct data is extracted
                  corruptBytesNotDetected++;

                  var message = string.Format("ByteCorrupted = {0}, CorruptBytesNotDetected = {1}",
                     byteToCorrupt, corruptBytesNotDetected);

                  Assert.IsNotNull(sampleOut, message);
                  Assert.AreEqual(sample, sampleOut, message);
               }
               catch(InvalidDataException)
               {
                  // data was corrupted, so we expect to get here
               }
            }
         }
      }

      // restore the data
      cmpData[byteToCorrupt]--;
   }
}

When I run this test, I get:

Assert.AreEqual failed. Expected:<This is a compression test of microsoft .net gzip compression method and decompression methods>. Actual:<>. ByteCorrupted = 11, CorruptBytesNotDetected = 8

So, this means there were actually 7 cases where corrupting the data made no difference (the string was successfully recovered), but corrupting byte 11 neither threw an exception, nor recovered the data.

Am I missing something or doing soemthing wrong? Can anyone see why the corrupt compressed data is not being detected?

Community
  • 1
  • 1
Andrew Teare
  • 51
  • 2
  • 2
  • Does GZip have a CRC? (There might be *many* "corrupt bytes" that can't be detected *while* decompressing, as they might still result in a "valid" or "not currently yet invalid" stream.) –  Feb 26 '12 at 19:56
  • The doc's say it does have a CRC. Nonetheless, some bytes can be corrupted and the data is recovered successfully (which is okay... the 7 cases noted above). However, if it doesn't throw an exception *and* it doesn't recover the data properly, then I think this is a problem (which is where the iterations in the test stop). – Andrew Teare Feb 26 '12 at 20:10
  • Check the exceptions that are thrown: it's not always InvalidDataException –  Feb 27 '12 at 00:54
  • However, I can't explain *why* the CRC32 doesn't cause an exception to be thrown in all cases (it should be *much more* reliable than it seems) even though it does detect CRC errors in, say, a measly 60-80% of the invalid cases "over 11" that don't cause a Huffman lookup exception or other internal decoding error. I was under the impression that CRC32 should result in detection rates of over 99.999% (and then some), which does not seem to be the case here :- –  Feb 27 '12 at 01:09
  • I'm only catching InvalidDataException, so any other would cause the test to fail because of that exception. However, the test only fails because no exception was thrown and the decompressed data is not as expected. Thanks for the idea though. I still feel like I'm missing something, but the test seems pretty straight forward. – Andrew Teare Feb 27 '12 at 01:13
  • I was wrong, so I updated/removed the old comments. The test stops because an IndexOutOfRangeException is thrown in certain cases (not an InvalidDataException) when the Huffman table lookups fail (bad Microsoft leaking details!). –  Feb 27 '12 at 01:14
  • In the mean time, I've "wrapped" the compressed data result in a simple CRC16 checksum and it catches 100% of the bytes corrupted (same test as above) as I expected. This leads me to think the "CRC" applied by GZipStream doesn't cover all the data. Even so, I wouldn't expect it to "pass" and yet still give bad data. – Andrew Teare Feb 27 '12 at 01:16
  • Per specification it should be *all* of the uncompressed data (CRC32+length) :( Why the catch rates are so low is ... surprising and not desirable. It would be interesting to see what other (better) .NET Gzip implementations do; it might be just (another) GZipStream failing. (I tested on .NET3.5 SP1) –  Feb 27 '12 at 01:17
  • That's a good idea, putting other classes to the test. I did try DeflateStream and got similar results. I may try some other solutions tomorrow. Thanks for your thoughts and ideas. – Andrew Teare Feb 27 '12 at 02:00
  • I think this problem "runs deeper". I have added a sister-question: http://stackoverflow.com/questions/9471826/why-does-gzipstream-crc32-not-reliably-detect-invalid-data –  Feb 27 '12 at 20:31

1 Answers1

8

There is a 10-byte header in the gzip format, for which the last seven bytes can be changed without resulting in a decompression error. So the seven cases you noted with no corruption are expected.

It should be vanishingly rare to not detect an error with a corruption anywhere else in the stream. Most of the time the decompressor will detect an error in the format of the compressed data, never even getting to the point of checking the crc. If it does get to the point of checking a crc, that check should fail very nearly all the time with a corrupted input stream. ("Nearly all the time" means a probability of about 1 - 2^-32.)

I just tried it (in C with zlib) using your sample string, which produces an 84-byte gzip stream. Incrementing each of the 84 bytes leaving the remainder the same, as you did, resulted in: two incorrect header checks, one invalid compression method, seven successes, one invalid block type, four invalid distances set, seven invalid code lengths set, four missing end-of-block, 11 invalid bit length repeat, three invalid bit length repeat, two invalid bit length repeat, two unexpected end of stream, 36 incorrect data check (that's the actual CRC error), and four incorrect length check (another check in the gzip format for the correct uncompressed data length). In no cases was a corrupted compressed stream not detected.

So there must be a bug somewhere, either in your code or in the class.

Update:

It appears that there is a bug in the class.

Remarkably (or maybe not remarkably), Microsoft has concluded that they won't fix this bug!

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • The .net GZipStream class produced 178 bytes of compressed data. Changing 45 of the bytes did not throw an exception and in 38 of these cases, the data was not recovered either. It's strongly pointing to a bug in the framework class, especially since you've tested the approach and another completely different implementation. Nonetheless, I'll review the code with someone else tomorrow. Thanks for taking the time to try this out. – Andrew Teare Feb 27 '12 at 06:29
  • @Mark Adler the issue is that **the CRC32 check does *not* seem to be working within excepted reliability bounds** :( Try the posted example (but catch Exception, not InvalidDataException) and you'll see that only *sometimes* is the CRC determined to be invalid ... which is odd. (Sometimes the internal Huffman tables usage generates an IOBE, which is why I suggest changing it to catch Exception for testing, but that's another issue that can cause the reader to vomit *before* getting to the CRC check.) –  Feb 27 '12 at 07:38
  • I do not have a ".NET" platform, so I can't try this directly. – Mark Adler Feb 27 '12 at 14:01
  • A 94-byte input to gzip will never produce a 178-byte output. The most you could get for incompressible data would be 117 bytes out. (That is the original 94 plus five bytes to mark the stored data plus 18 bytes for the header and trailer.) I have no idea what this GZipStream stream class is producing, since it does not seem like it is producing what gzip produces. – Mark Adler Feb 27 '12 at 14:07
  • I fully understand that the poster is not getting a report back of a crc-32 mismatch or an invalid format. That is not a characteristic of the gzip format. The gzip format provides the information needed to detect the corruptions the poster is testing very nearly all of the time. Therefore there is a bug in the class or the application using it, assuming that the GZipStream class actually generates the gzip format. – Mark Adler Feb 27 '12 at 14:09
  • @Mark Adler That sanity is not true in the standard .NET BCL implementaion, unfortunately -- GZipStream is notoriously bad for *not* doing this "obvious optimization" (as well as having inflated / "non ideal" dictionaries, etc.) :( After running my own tests, I would not rule out "a bad implementation"; an implementation of GZip that only *sometimes* detects the corruption via the CRC32 is ... odd. (And it does *sometimes*, say 60-80%, detect the corruption.) –  Feb 27 '12 at 19:45
  • (So, yes, I entirely agree with your response, but it does leave out the *why* and *where* bits :-) –  Feb 27 '12 at 19:50
  • I reviewed the test with someone else and we couldn't see anything fundamentally wrong. The StreamReader was a bit suspicious, so I got rid of it and used a simple byte array to avoid further string conversions, but the results were similar. I agree we should practically "never" miss corrupt data with a CRC (we use it often in many protocols). I'll have to apply my own checksum, or use another library. I'm not too concerned with the performance on small data sets since the application will be with much larger ones, but I appreciate knowing the performance relative to other implementations. – Andrew Teare Feb 28 '12 at 02:21
  • Since it pretty much seems like a bug at this point, I'll report this on Microsoft Connect too. – Andrew Teare Feb 28 '12 at 02:31
  • @AndrewTeare I knoe this is old but just hit the same issue. Is http://connect.microsoft.com/VisualStudio/feedback/details/726860/gzipstream-does-not-seem-to-detect-corrupt-data your connect case? – Johannes Rudolph Mar 04 '13 at 09:19
  • Yes, that's the issue I reported. I should have put a link here too... thanks for adding it. Too bad it "won't be fixed". I ended up using the GZipStream, but calculated a checksum and saved that with the data to detect corrupt data. – Andrew Teare May 07 '13 at 02:14
  • 2
    @AndrewTeare The link is broken. MS has done a lot of mess with its issue tracker. I would read MS answer, if you help me find it or repost it here. Especially after Mark Adler cited this issue there: http://stackoverflow.com/questions/17212964/net-zlib-inflate-with-net-4-5 – tsul Aug 04 '16 at 16:15