1

I have implemented a class for Huffman coding. The class will parse an input file and build a huffman tree from it and creates a map which has each of the distinct characters appeared in the file as the key and the huffman code of the character as its value.

For example, let the string "aravind_is_a_good_boy" be the only line in the file. When you build the huffman tree and generate the huffman code for each character, we can see that, for the character 'a', the huffman code is '101' and for the character 'r', the huffman code is '0101' etc.

My intention is to compress the file. So I cannot write a string, which is created by replacing each character, by its huffman code, directly to the file. Since, each character would be replaced by at least 3 characters (Each '1' and '0' would still be written into the file as a character, not bits). So I thought I would write it to a file as a bytes, since there is no way you can write bits to a file. But then, 'a' and 'r' are both written as '5' into the file. This would cause problem when trying to decompress the file.

This is how I am converting a series of bits to bytes:

public byte[] compressString(String s, CharCodeHashMap map) {
        String byteString = "";
        byte[] byteArr = new byte[s.length()];
        int size = 0;
        for (int i = 0; i < s.length(); i++) {
            byteString += addPaddingZeros(map.getCompressedChar(s.charAt(i)));
            byteArr[size++] = new BigInteger(byteString, 2).toByteArray()[0];
            byteString = "";
        }

        return byteArr;
    }

I tried prefixing '1' to each of the hashcodes, to fix the problem. But then, when you build a huffman tree, reading a file, some characters would have more than 8 bits. Then, the problem is new BigInteger(byteString, 2).toByteArray() would have more than 1 element in the array.(For eg, if 'v' has the hashcode '11010001' and new BigInteger(byteString, 2).toByteArray() returns an array of elements [0, -47].)

Can someone please suggest me a way to write to a file such that, the file would be compressed and at the same time, these problems are also taken care.

UnderWood
  • 803
  • 3
  • 12
  • 23

2 Answers2

0

The problem is that files in modern operating systems are modeled as indexable sequences of bytes1.

So what you need is a way to encode the fact that your file is representing a number of bits that may not be a multiple of 8. That means the bit stream size is not necessarily the file size (in bytes) multiplied by 8.

There are a variety of solutions:

  • Reserve N bytes at the start of the file for the file size in bits. For example, reserving 4 bytes allows you to represent file sizes up to 232 bits.
  • Reserve 3 bits at the start of the file to hold the number of bits modulo 8. You can use this to decide how many bits in the last byte of the file to ignore.
  • Use some kind of encoding to represent the end of stream; e.g. represent it as a character in the text stream that you are encoding.

Is there a way to deal with this without using some bits? AFAIK, No.


1 - And at a lower level, files are represented as sequences of disk blocks consisting of multiple bytes. So, from a physical storage perspective, compressing files that are already small (e.g. smaller than a disk block) doesn't achieve anything. Similarly saving or not saving (say) 3 bits when the representation is modeled as a byte sequence is at the border of being pointless ... if that was what was concerning you.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

Yes, you can write bits to a file. In fact you are always writing bits to a file. The only thing is that you are writing eight bits at a time.

What you need is a bit buffer, say a 32-bit unsigned variable, into which you accumulate bits. Have another integer that tracks how many bits are in the bit buffer. Use the shift left and or (or plus) operators to put more bits in the bit buffer, and the and and shift right operators to remove them. Whenever you have eight or more bits in the bit buffer, you write those eight bits to the file as a byte. At the end, write the remaining bits (if any) to the file as the last byte.

So, to add the bits bits in value to the buffer:

bitBuffer |= value << bitCount;
bitcount += bits;

to write and remove available bytes:

while (bitCount >= 8) {
    writeByte(bitBuffer & 0xff);
    bitBuffer >>>= 8;
    bitCount -= 8;
}

You need to make sure that when decoding, you don't mistake the filler bits in the last byte as another code. You can either send the actual number of bits in the message preceding the message (or the number of bits in the last byte), or you can add a symbol to your alphabet for end-of-stream that gets its own Huffman code, and end the message with that.

The other problem you have is that you will also need to transmit the Huffman code itself to the decoder before the coded symbols in order for the decoder to know how to decode. Look up "canonical Huffman codes" for how to approach that efficiently.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158