7

My friend told me it existed but I could never find it, not sure if he was lying but I'm very interested as to how the proof works. (Yes, I'm one of those people who found out about Huffman coding from the Silicon Valley TV show, sorry)

  • 1
    http://www.cs.utoronto.ca/~brudno/csc373w09/huffman.pdf – Eric J. Nov 03 '15 at 23:42
  • It is probablistic. It depends on the type of data being compressed. – Jeevan Devaranjan Nov 04 '15 at 01:57
  • 1
    It is the most efficient possible algorithm given that the input symbols are independent and identically distributed, and the encoded symbols must each be a whole number of bits. If encoded symbols don't need to be a whole number of bits you have arithmetic coding. If input symbols aren't independent and identically distributed, then neither Huffman nor arithmetic coding is optimal. – user253751 Jan 09 '16 at 12:43

3 Answers3

7

The answer is it is, it isn't, and the question is ill-posed. :-)

Here is a high level view. Lossless compression algorithms provide a reversible mapping of possible documents to be compressed, to compressed documents. Documents can be viewed as strings of bits. There are 2^n possible documents with n bits. There are 2^n possible compressed documents with n bits. Therefore the pidgin-hole principle says that for every document that is stored more efficiently, some other possible document must be stored less efficiently.

So how is compression possible? It is possible because while all documents are possible, they are not equally likely. So a good compression algorithm will store likely documents very efficiently, and unlikely ones inefficiently. But then the question is what documents are efficient. The answer to that is, "It depends." And the answer to how good a compression algorithm is will also depend.

Suppose that you take the set of random documents made out of a set of symbols that independently appear with different probabilities. Huffman coding produces the most efficient possible compression algorithm.

Now suppose you take the set of random sentences that are likely to be written in English? Huffman coding is limited to looking at raw letter frequencies. It makes no use of the fact that certain combinations of letters appear very frequently. Other encodings that can use that will now work better.

Now suppose you take the set of documents that could be produced by your camera. This looks nothing like text, and different encoding methods will work better.

So there are cases where Huffman is best. Cases where it isn't. And the question is ill-posed since it depends on, "What documents are likely?"

btilly
  • 43,296
  • 3
  • 59
  • 88
  • The optimality of Huffman coding not only depends on characteristics of the uncompressed data (source), but also on "code each input symbol on its own, using an integral number of output symbols" - alternatives: string compression (multiple input symbols at once), encodings line arithmetic coding. – greybeard Nov 04 '15 at 08:33
  • (Dang! I must have missed `pidgin-hole` the last time around:) – greybeard Dec 25 '16 at 15:44
5

It is not the most efficient lossless compression method. Arithmetic coding beats it for a start. Since it is not the most efficient, there is no proof that it is. I believe it is the optimal code when using an integer number of bits per symbol however, perhaps that is the proof your friend was talking about.

mattnewport
  • 13,728
  • 2
  • 35
  • 39
3

Proof of Optimality of Huffman Codes, CSC373 Spring 2009.

It proves intermediate theorems and arrives at:

Theorem 3 The algorithm HUF(A,f) computes an optimal tree for frequencies f and alphabet A.

BoppreH
  • 8,014
  • 4
  • 34
  • 71