4

I wanted to compress some data so i thought i'd run the stream by deflate

It went from 304 bytes to 578. Thats 1.9x larger. I was trying to compress it..... What am i doing wrong here?

using (MemoryStream ms2 = new MemoryStream())
using (var ms = new DeflateStream(ms2, CompressionMode.Compress, true))
{
    ms.WriteByte(1);
    ms.WriteShort((short)txtbuf.Length);
    ms.Write(txtbuf, 0, txtbuf.Length);
    ms.WriteShort((short)buf2.Length);
    ms.Write(buf2, 0, buf2.Length);
    ms.WriteShort((short)buf3.Length);
    ms.Write(buf3, 0, buf3.Length);
    ms.Flush();
    result_buf = ms2.ToArray();
}
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • What happens if you put your data in a file and zip it? – Greg Hewgill Jul 14 '12 at 06:09
  • @GregHewgill zip gives me 465, gz gives me 349, original is 319. (i dont know what i changed but the data is randomly made in every run). I cant tell what .NET deflate tries to do as i made the stream be an io stream.I'd have to write more code to check the data on this run –  Jul 14 '12 at 06:18
  • 204 bytes is less than one network packet, it is also less than one disk sector (and much less than one on the newest – 4k sector – disks). The overhead of compress of such small amounts of data is going to overwhelm any saving from size which there usually won't be (even if you were not hitting this issue). – Richard Jul 14 '12 at 16:46
  • @Richard it is a quick test. My txtbuf will be longer. However the accepted answer explains things perfectly –  Jul 14 '12 at 16:52

6 Answers6

4

The degree to which your data is expanding is a bug in the DeflateStream class. The bug also exists in the GZipStream class. See my description of this problem here: Why does my C# gzip produce a larger file than Fiddler or PHP?.

Do not use the DeflateStream class provided by Microsoft. Use DotNetZip instead, which provides replacement classes.

Incompressible data will expand slightly when you try to compress it, but only by a small amount. The maximum expansion from a properly written deflate compressor is five bytes plus a small fraction of a percent. zlib's expansion of incompressible data (with the default settings for raw deflate) is 5 bytes + 0.03% of the input size. Your 304 bytes, if incompressible, should come out as 309 bytes from a raw deflate compressor like DeflateStream. A factor of 1.9 expansion on something more than five or six bytes in length is a bug.

Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
3

It's possible that the data you are trying to compress is not actually compressible (or you do not have a lot of data to compress to begin with). Compression works best when there are repetitions in the data.

It's probably bigger because the compression scheme is adding metadata used to decrypt the stream, but because the data is not compressible or there is not a lot of data for compression to take effect, it is actually making it worse.

If you did something like zip a zip file, you would find that decompression does not always make things smaller.

Wulfram
  • 3,292
  • 2
  • 15
  • 11
  • This will definitely be the case, also, note that it depends on the size of the stream, smaller streams are much harder for compression to get the benefits of due to the ratio of overheads:datasize being much higher. – Aaron Murgatroyd Jul 14 '12 at 06:00
  • That is a good point. I guess I should also mention that lots of data is already compressed by nature, like images/videos/audio files, or encrypted files. When you try to compress these types of data, things will frequently get worse. – Wulfram Jul 14 '12 at 06:13
3

Small blocks of data often end up larger because the compression algorithm uses a code table that gets added to the output or it needs a bigger sample to find enough to work with.

You're not doing anything wrong.

Patrick Hughes
  • 345
  • 2
  • 8
2

Shouldn't it be

using (var ms = new DeflateStream(ms2, CompressionMode.Compress, true))

instead of

using (var ms = new DeflateStream(ms, CompressionMode.Compress, true))

If you want to decorate your MemoryStream with a DeflateStream, it should be this way arround.

darkey
  • 3,672
  • 3
  • 29
  • 50
  • oops, your right. It was a typo. C# gives a compile error if i have it the incorrect way –  Jul 14 '12 at 06:04
0

You answered your own question in your comment:

i dont know what i changed but the data is randomly made in every run

Random data is hard to compress. In general, when data has many patterns within it (like the text from a dictionary or a website) then it compresses well. But the worse case for a compression algorithm is when you're faced with random data. Truly random data does not have any patterns in it; how then can a compression algorithm expect to be able to compress it?

The next thing to take into account is the fact that certain compression algorithms have overhead in how they store data. They usually have some header bits followed by some symbol data. With random data, it's almost impossible to compress the data into some other form and you end up with tons of header bits interspersed in between your data which serve no purpose other than to say "the following data is represented as such."

Depending on your compression format, the overhead as a percentage of the total file size can either be relatively small or large. In either case though, you will have overhead that will make your new file larger than your old one.

Mike Bailey
  • 12,479
  • 14
  • 66
  • 123
  • Untrue, well parts of it. Only one part of what i compressed is random (time + a bit of other random like bytes) and there is a lot of ascii and base64 text which should be compressable ;). Its actually .NET implementation. It sucks. Using deflate on an html file gets me 26% while mono gets me 5% of the original file –  Jul 14 '12 at 16:22
  • 1
    No, he did not answer his own question. There is no reason for the data to expand _that much_. That turns out to be a bug, courtesy of Microsoft. See my answer. – Mark Adler Jul 14 '12 at 16:25
0

I don't have the reputation to leave a comment, however the reason why compression performance is worse than you would expect is not due to a bug per se, but apparently a patent one:

The reason for the compression level not being as good as with some other applications is that the most efficient compression algorithms on the market are all patent-protected. .net on the other hand uses a non-patented one.

and

Well, the explanation I got (from someone at MS), when I asked the same thing, was that it had to do with Microsoft not being able to use the GZip algorithm without modifying it; due to patent/licensing issues.

http://social.msdn.microsoft.com/Forums/fr-FR/c5f0b53c-a2d5-4407-b43b-9da8d39c01df/why-do-gzipstream-compression-ratio-so-bad?forum=netfxbcl

Initial I suspected Microsoft’s gzip implementation; I knew that they implemented the Deflate algorithm which isn’t the most effective but is free of patents.

http://challenge-me.ws/post/2010/11/05/Do-Not-Take-Microsofts-Code-for-Granted.aspx

Matthew1471
  • 238
  • 2
  • 11