0

I've been searching the Internet but couldn't find what I need.

I have to compress big files using the Huffman coding. My idea was to read the first 1-2MB of the file

(to avoid first reading the whole file to build the tree, and then reading it once more to encode it, avoiding O(2n) ),

and build the Huffman tree. If any of the 256 alphabet byte was missing, I'd add it by myself, in case it appears later in the file(and not in the first 1-2 MBs). But trying to test the result using this:

int * totalFr = new int[256];
unsigned char * symArr= new  unsigned char[256];

for (int i = 0; i < 256; i++)
{
    totalFr[i] = i;
    symArr[i] = unsigned char(i);
}

int size = sizeof(symArr) / sizeof(symArr[0]);
buildHuffmanTree(totalFr,symArr, size );
delete[] totalFr;
delete[] arrei;

where buildHuffmanTree is a function, which builds the Huffman tree, made my realise the best character code I could get was 7 bits, for example 0000001.

And this is where my question came from - is it worth it to build Huffman Tree for a full 256 words alphabet? Or is it better to use adaptive Huffman Coding for chunks like 1-2MB

Petar D.
  • 149
  • 1
  • 5
  • 15
  • delete[] isn't C. Either your tag is wrong or you're going to have compilation issues – UKMonkey Jan 24 '17 at 12:02
  • I'm doing it on C++. I tagged it `C` because I thought it's similar, and people would rather tell me to use `std::vector` then actually answering me about the algorithm question. I will untag it now :) – Petar D. Jan 24 '17 at 12:09
  • Depends totally on the file data. If the data in the rest of the file is "close enough" to the beginning, then your idea works. If not, then you must do a different table for each section. Try both ways on a few files. – stark Jan 24 '17 at 15:19

1 Answers1

2

You can't expect a whole lot out of just Huffman coding unless the data is extremely biased with respect to which bytes are present. I just tried on a 100 MB file of English text from Wikipedia. It got the file down to 63% of its original size, so maybe eight bits down to five bits on average. Also that was doing the Huffman in blocks of about 16 KB at a time, so that the code was adapted to each block.

Normal zlib compression, which also looks for matching strings, gets it down to 35% of the original size. More advanced compressors, such as xz which spend more time and memory looking harder and farther for matching strings as well as do a little better than Huffman coding, get it down to 26% of the original size.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • I was worried if doing the Huffman coding with complete 256 words set size can exceed the original data size? What do you mean that the tree adapted for each 16 KB block? That you do normal Huffman coding and use it for 16 KB chunks, or you used `adaptive Huffman` (and why would you update it per 16KB, and not per 1 char) ? Thanks in advance :) – Petar D. Jan 24 '17 at 20:52
  • 1
    Yes, adding symbols that don't appear could result in expansion instead of compression for otherwise compressible data. You should instead do what zlib does, which is make a new Huffman code for each block of data. The time it takes to compute a new code and the bits it takes to send a description of the code can be sufficiently small fractions for large enough blocks. – Mark Adler Jan 24 '17 at 20:57
  • So I need to serialize the tree in front of every chunk, right? Hence, there is no way to avoid `n` for creating the tree and one more `n` for encoding it? – Petar D. Jan 24 '17 at 21:04
  • 1
    Yes, but you don't need to send the tree. You just need to send the lengths, which can themselves be compressed, and use a canonical code. – Mark Adler Jan 24 '17 at 21:23
  • I don't understand. How can I decode it later, if I don't have the tree? Is this what you mean by `send the lengths` - http://stackoverflow.com/questions/759707/efficient-way-of-storing-huffman-tree – Petar D. Jan 24 '17 at 21:49
  • 1
    All you need is the number of bits for each symbol, from which you can construct the same Huffman code on both ends. You can read [the article on Wikipedia](https://en.wikipedia.org/wiki/Canonical_Huffman_code). – Mark Adler Jan 24 '17 at 21:55
  • Thank you again, if I have any questions I will ask you here . – Petar D. Jan 24 '17 at 22:12
  • I did the encoding canonically, but now I don't know how to do the decoding? Can you suggest something? I've been looking to a `look-up table` but I'm not sure it's appropriate for canonical Huffman – Petar D. Jan 27 '17 at 18:56
  • 1
    You can look at [puff.c](https://github.com/madler/zlib/blob/master/contrib/puff/puff.c) for an example of a simple canonical code decoder. `decode()` takes a list of the number of symbols of each code length, and a list of corresponding symbols. – Mark Adler Jan 27 '17 at 19:13
  • I did it like you said - chunks of 16KB each. It does pretty good with `.txt` files, but goes with 0 compress ratio for video files. I checked and saw that in the video files almost all 256 characters take place. Do you have any idea how can I compress such files better with this algorithm? – Petar D. Jan 28 '17 at 22:44
  • Video files are already compressed. You can't compress them any more as is. The only thing you could do is recode them with a different video format. – Mark Adler Jan 29 '17 at 01:21
  • I have one final question. It is all working OK now, but the last character when I decode seems to be arbitrary. I think this is, because when I encode, and when I reach the end if I have to put only `101` for example, I will put `00000101` in the last byte, and then when decoding these zeros mess up. I don't how to reach something like `End Of Chunk` – Petar D. Jan 30 '17 at 23:28
  • You need to add an end-of-stream symbol with frequency 1. – Mark Adler Jan 30 '17 at 23:43