I have implemented a class for Huffman coding. The class will parse an input file and build a huffman tree from it and creates a map which has each of the distinct characters appeared in the file as the key and the huffman code of the character as its value.
For example, let the string "aravind_is_a_good_boy" be the only line in the file. When you build the huffman tree and generate the huffman code for each character, we can see that, for the character 'a', the huffman code is '101' and for the character 'r', the huffman code is '0101' etc.
My intention is to compress the file. So I cannot write a string, which is created by replacing each character, by its huffman code, directly to the file. Since, each character would be replaced by at least 3 characters (Each '1' and '0' would still be written into the file as a character, not bits). So I thought I would write it to a file as a bytes, since there is no way you can write bits to a file. But then, 'a' and 'r' are both written as '5' into the file. This would cause problem when trying to decompress the file.
This is how I am converting a series of bits to bytes:
public byte[] compressString(String s, CharCodeHashMap map) {
String byteString = "";
byte[] byteArr = new byte[s.length()];
int size = 0;
for (int i = 0; i < s.length(); i++) {
byteString += addPaddingZeros(map.getCompressedChar(s.charAt(i)));
byteArr[size++] = new BigInteger(byteString, 2).toByteArray()[0];
byteString = "";
}
return byteArr;
}
I tried prefixing '1' to each of the hashcodes, to fix the problem. But then, when you build a huffman tree, reading a file, some characters would have more than 8 bits. Then, the problem is new BigInteger(byteString, 2).toByteArray()
would have more than 1 element in the array.(For eg, if 'v' has the hashcode '11010001' and new BigInteger(byteString, 2).toByteArray()
returns an array of elements [0, -47].)
Can someone please suggest me a way to write to a file such that, the file would be compressed and at the same time, these problems are also taken care.