What is the best file compression of random binary data that you can achieve?

Question

Specifically, what programs are out there and what has the highest compression ratio? I tried Googling it, but it seems experience would trump search results, so I ask.

Truly random data can't be compressed. ;-) The more helpful answer is longer: what are the properties of the data being compressed? (sound, image, video, binary executable, etc.) Can you tolerate loss of information? — Throwback1986, Jan 17 '11 at 17:47
As an example, lzw (i.e. gif) style of compression doesn't reduce the filesize of photos as well as jpeg compression. On the other hand, jpeg compression of "artificial" images, like a comic strip, will result in a noticeable loss of quality. — Throwback1986, Jan 17 '11 at 17:51
Then it can't be compressed *loss-lessly*. In fact, you will find your filesize will increase. — Throwback1986, Jan 17 '11 at 21:11
Apparently random data is impossible to compress. Imagine that. Impossible. So I shouldn't be able to do it? That's disappointing. — ox_n, Jan 18 '11 at 08:04
"Truly random data can't be compressed." Haha. False. Truly random data *that is perfectly spread* cannot be compressed. Compression relies on redundancy, and redundancy is very possible in random data. Granted, you likely won't get much compression out of random data due to its tendency to be much more evenly spread, but it is certainly possible. — Andrew, May 30 '15 at 01:29
@Andrew Are you saying that: if you compress every possible n byte file, the average size of the compressed results will be less than n bytes? I'm pretty sure you can disprove that mathematically. — , Dec 08 '16 at 15:25
Very, very slightly, yes. True randomness != perfectly non-redundant distribution; therefore, due to expected redundancies of even a slight amount, an optimal compression algorithm applied will result in very slightly smaller file sizes. — Andrew, Dec 09 '16 at 17:09
@Andrew-liked the subtle points in your comments. A fair coin tossed 6 times can yield `TTTHHH`-though this occurrence is not very likely. — wabbit, Jun 06 '17 at 11:20

supercat · Accepted Answer · 2017-03-31T15:40:02.113

If file sizes could be specified accurate to the bit, for any file size N, there would be precisely 2^(N+1)-1 possible files of N bits or smaller. In order for a file of size X to be mapped to some smaller size Y, some file of size Y or smaller must be mapped to a file of size X or larger. The only way lossless compression can work is if some possible files can be identified as being more probable than others; in that scenario, the likely files will be shrunk and the unlikely ones will grow.

As a simple example, suppose that one wishes to store losslessly a file in which the bits are random and independent, but instead of 50% of the bits being set, only 33% are. One could compress such a file by taking each pair of bits and writing "0" if both bits were clear, "10" if the first bit was set and the second one not, "110" if the second was set and the first not, or "111" if both bits were set. The effect would be that each pair of bits would become one bit 44% of the time, two bits 22% of the time, and three bits 33% of the time. While some strings of data would grow, others would shrink; the pairs that shrank would--if the probability distribution was as expected--outnumber those that grow (4/9 files would shrink by a bit, 2/9 would stay the same, and 3/9 would grow, so pairs would on average shrink by 1/9 bit, and files would on average shrink by 1/18 [since the 1/9 figure was bits per pair]).

Note that if the bits actually had a 50% distribution, then only 25% of pairs would become one bit, 25% would stay two bits, and 50% would become three bits. Consequently, 25% of bits would shrink and 50% would grow, so pairs on average would grow by 25%, and files would grow by 12.5%. The break-even point would be about 38.2% of bits being set (two minus the golden mean), which would yield 38.2% of bit pairs shrinking and the same percentage growing.

I take it that is a simple explanation of the Kolmogorov complexity. Not bad. — ox_n, Jan 17 '11 at 18:46
More detailed explanations would tend to make many readers' eyes glaze over. Although the approach of compressing two bits at a time to 1-3 output bits is simple, I think it conveys pretty well the nature of the challenge. Compressing 1-3 input bits into 2 output bits would be another approach e.g. (000, 001, 01, 1) but computing the associated probabilities would be harder. — supercat, Jan 17 '11 at 19:57
Excellent explanation of "why" compression works. I have always been a victim of the eye-glazing. +1 — Steven Lu, Sep 15 '11 at 23:52

score 10 · Answer 2 · answered Jan 17 '11 at 17:57

10

There is no one universally best compression algorithm. Different algorithms have been invented to handle different data.

For example, JPEG compression allows you to compress images quite a lot because it doesn't matter too much if the red in your image is 0xFF or 0xFE (usually). However, if you tried to compress a text document, changes like this would be disastrous.

Also, even between two compression algorithms designed to work with the same kind of data, your results will vary depending on your data.

Example: Sometimes using a gzip tarball is smaller, and sometimes using a bzip tarball is smaller.

Lastly, for truly random data of sufficient length, your data will likely have almost the same size as (or even larger than) the original data.

answered Jan 17 '11 at 17:57

helloworld922

10,801
5
48
85

There has to be one universally best compression algorithm. I think that logic would demand that be true, unless there were multiple algorithms of equal compression ratios tied for the best. – ox_n Jan 17 '11 at 18:49
There are indeed many methods which could be considered "tied" for the best compression ratio for a particular type of data, as well as many methods which are specialized for a particular type of data which offer better performance for these types of data over generic methods (audio, picture, movie, etc.). You need to determine what assumptions can you make about your data, with the more assumptions usually (but not always) resulting in higher compression ratios for that particular type of data. – helloworld922 Jan 17 '11 at 21:08

What is the best file compression of random binary data that you can achieve?

2 Answers2

Linked