2

I want to save Huffman codes into a file. How can I do this? I am saving Huffman codes into a string but size of generated file is bigger than Original file.

Maynza
  • 748
  • 5
  • 18
zahratTZ
  • 21
  • 1
  • 4
  • 1
    Huffman coding results in variable bit-length codes. If you don't pack the codes together on output, so a 5-bit code only takes five bits, you won't get compression. You need to create a byte buffer that you can fill n-bits at a time, where n is not a multiple of 8. – antlersoft Jul 07 '11 at 19:23
  • I dont't understand the "You need to create a byte buffer that you can fill n-bits at a time, where n is not a multiple of 8 " ????? How can do it? – zahratTZ Jul 07 '11 at 19:32
  • You will have to set the bits manually, either by encoding your data into another datatype like an int or using another library for a bitset. – Maynza Jul 07 '11 at 19:37
  • static huffman algorithm, I suppose. for the dynamic huffman algorithm, you don't need to store anything :) – ShinTakezou Jul 07 '11 at 21:07
  • @antlersoft What if the encoded data does not result into an overall bit-count that isn't multiple of 8? One can't simply append 0-Bits to the end of the file to complete the last byte... the dummy-zeros would be decoded as well when reading the compressed file... – Silicomancer Oct 02 '21 at 23:34

3 Answers3

5

A very simple approach is to write one bit at a time with something like the following:

unsigned char acc; // Accumulator of bit waiting to be written
int bitcount;      // How many bits are aready present in the accumulator

// write a single bit (0/1)
void writebit(int bit)
{
    acc |= (bit << bitcount);
    if (++bitcount == 8)
    {
        writebyte(acc);
        acc = 0;
        bitcount = 0;
    }
}

to read back a sigle bit the procedure is symmetrical

unsigned char acc;   // bits waiting to be extracted
int bitcount;        // how many bits are still available in acc

int readbit()
{
   if (bitcount == 0)
   {
       bitcount = 8;
       acc = readbyte();
   }
   --bitcount;
   return (acc >> (7 - bitcount)) & 1;
}

of course this is just the simplest approach, but I'd wait before worrying about code speed until you are first able to save and load correctly your encoded data.

Example:

Suppose you have the following Huffman coded symbols

A - 0
B - 10
C - 110
D - 111

and that you want to encode the sequence

A B A A C D A D B B

then you would call in order

writebit(0);                           // A
writebit(1); writebit(0);              // B
writebit(0);                           // A
writebit(0);                           // A
writebit(1); writebit(1); writebit(0); // C
writebit(1); writebit(1); writebit(1); // D
writebit(0);                           // A
writebit(1); writebit(0);              // B
writebit(1); writebit(0);              // B

The actual bytes written would therefore be

(01100010) = 0x62
(01010111) = 0x57

(Note that the code shown starts from the least significant bit, i.e. you should read the bit sequences inside the parenthesis from right to left if you want to recognize the symbols).

6502
  • 112,025
  • 15
  • 165
  • 265
  • I want to save Huffman codes into a file. How can I do this? I am saving Huffman codes into a string but size of generated file is bigger than Original file. If file compressed correctly , How can read for extract? i mean if char 'a' has "11" Huffman code and b has "011" Huffman code(both of two has same ASCII code) How can we Understand which One is true 'a' or 'b'? – zahratTZ Jul 07 '11 at 21:35
  • 1
    I've added an example of what bytes would be output for a small sequence of Huffman encoded symbols. When reading a bit-coded stream you don't think in terms of bytes, but of bits... a single symbol can happen to be part in one byte and part in another (this however doesn't happen in my example). – 6502 Jul 07 '11 at 22:57
1

I believe what you are saving is a string of 1's and 0's. A true huffman code needs to be saved in binary and then parsed later on. If you are merely saving the output as a string you are defeating the purpose of a huffman code, each 0 and 1 is 8bits instead of 1.

Maynza
  • 748
  • 5
  • 18
  • I see no reason to have downvoted this. This is a very common problem among students and those unfamiliar with binary storage. I personally +1 this answer. – Zéychin Jul 07 '11 at 19:31
  • @Zéychin It is a very common problem among students, like myself a few years ago. ;) – Maynza Jul 07 '11 at 19:35
  • Exactly. Not everyone is gifted with computer-like thought-processes. (It's a good thing, too, because I would hate to SEGFAULT. :P.) – Zéychin Jul 07 '11 at 19:39
  • worth saying, though nothing made me think the OP did store the bits in that way, "as" string — he said he stored them "into" a string, which makes me think string is just a synonym for "buffer"; but in case I am wrong, +1 to have imagined a not-so-improbable mistake – ShinTakezou Jul 07 '11 at 21:13
1

You are probably saving the entire byte for each pattern/letter.

Let's say e is the most common letter. It will have a bit pattern of 0.

Let's say z is the least common letter it will have some pattern starting with 1. Let's just assign it to be 1111 111.

The file you want to be writing will be this:

0111 1111

You are PROBABLY writing this:

0000 0000 0111 1111.

You need to take advantage of bitwise operations to do this.

Zéychin
  • 4,135
  • 2
  • 28
  • 27
  • To be fair, that's the trivial part. It's a bit harder to store the run *lengths* in a compact manner along with the actual runs. – Blindy Jul 07 '11 at 19:25
  • Well, to be fair, it answers the question appropriately. This way of storing the data results in a file shorter than the original file. The reference you gave is also great, but perhaps a bit overwhelming for someone who may not be understanding the coding tree in the first place. – Zéychin Jul 07 '11 at 19:37
  • 1
    @BLindly: That's not needed. In a properly encoded Huffman encoded bit sequence you know when a symbol is finished because you reach a leaf in the Huffman tree. – 6502 Jul 07 '11 at 23:00