1

I am trying to implement compression of files using Huffman encoding. Currently, I am writing the header as the first line of the compressed file and then writing the encoded binary strings (i.e. strings having the binary encoded value).

However, instead of reducing the file size, my file size is increasing as for every character like 'a', I am writing its corresponding binary, for example 01010001 which takes more space.

How can I write it into the file in a way that it reduces the space?

This is my code

public void write( String aWord ) {

        counter++;
        String content;
        byte[] contentInBytes;

        //Write header before writing file contents
        if ( counter == 1 )
        {
            //content gets the header in String format from the tree
            content = myTree.myHeader;
            contentInBytes = content.getBytes();

            try {
                fileOutputStream.write(contentInBytes);
                fileOutputStream.write(System.getProperty("line.separator").getBytes());
            } catch (IOException e) {
                System.err.println(e);
            }
        }

        //content gets the encoded binary in String format from the tree
        content = myTree.writeMe(aWord);
        contentInBytes = content.getBytes();


            try {
                fileOutputStream.write(contentInBytes);
                fileOutputStream.write(System.getProperty("line.separator").getBytes());
            } catch (IOException e) {
                System.err.println(e);
            }
        }

Sample input file:

abc
aef
aeg

Compressed file:

{'g':"010",'f':"011",'c':"000",'b':"001",'e':"10",'a':"11"}
11001000
1110011
1110010
JGPhilip
  • 1,388
  • 17
  • 19
  • Is there any calling code for this? How are you populating myTree? – J E Carter II Oct 12 '15 at 17:17
  • Yes, there is a linked list that has the characters and their values, and the "content" is getting the correct binary value for that particular line. My only issue is space here, so I need to write to the file in a way that it takes lesser space that what it is doing now since my current compressed file is ending up being 4-5 times size of the original – JGPhilip Oct 12 '15 at 17:21
  • So you can test or log to verify myTree has unique members... e.g. "a" is not repeated. – J E Carter II Oct 12 '15 at 17:22
  • Yes there are no instances of duplicates in the tree. I just need the writing to be done efficiently. – JGPhilip Oct 12 '15 at 17:24
  • Once solution would be using ASCII char set 0-255 grouping the bits to byte (8).. since now when you are writing 0 you are actually writing byte 48. – Petter Friberg Oct 12 '15 at 17:27
  • How can I do that @PetterFriberg? – JGPhilip Oct 12 '15 at 17:30
  • Something like this http://stackoverflow.com/questions/18975764/how-to-convert-binary-string-to-the-byte-array-of-2-bytes-in-java – Petter Friberg Oct 12 '15 at 17:33
  • As an excercise the result looks fine to me. You can not expect any decreas in size because you are writing text (if you output binary the output would not be human-readable without extra tools). Also, since your input is very small, the header (the symbol table) wil already be bigger than the original input, even *if* you output binary data. The huffman encoding itself looks fine as is, what are you trying to achieve? – Durandal Oct 12 '15 at 18:57
  • How can I write in binary? I do not need my compressed file to be in human readable format @Durandal. And for bigger files, the size is increasing a lot, for instance for a 11MB file, I am getting the compressed file to be of about 55MB – JGPhilip Oct 12 '15 at 19:34
  • You can't write individual bits to file you need something like the answer, fine a logic for the grouping and your ok, example 1 byte tells how many digits are used and second the digts.. note also this simple way you will reduce file size to 1/4 – Petter Friberg Oct 12 '15 at 20:58

1 Answers1

6

As I gathered from the comments, you are writing text, but what you really want to achive is writing binary data. What you currently have is a nice demo for huffman encoding, but impractical for actually compressing data.

To achieve compression, you will need to output the huffman symbols as binary data, where you currently output the string "11" for an 'a', you will need to just output two bits 11.

I presume this is currently coded in myTree.writeMe(), you need to modify the method to not return a String, but something more suited to binary output, e.g. byte[].

It depends a bit on the inner workings of you tree class how to do this. I presume you are using some StringBuilder internally and simply add the encoded symbol strings while looping over the input. Instead of a StringBuilder you will need a container capable of dealing with single bits. The only suitable class that comes to min immediately is java.util.BitSet (in practice one often will write a specialized class for this, with a specialized API to do this quickly). But for simplicity lets use BitSet for now.

In method writeMe, you will in principle do the following:

 BitSet buffer = new BitSet();
 int bitIndex = 0;
 loop over input symbols {
     huff_code = getCodeForSymbol(symbol)
     foreach bit in huff_code {
         buffer.put(bitIndex++, bit)
     }
 }
 return buffer.toByteArray();

How to do this efficiently depends on how you internally defined the huffman code table. But prinicple is simple, loop over the code, determine if each place is a one or a zero and put them into the BitSet at consecutive indices.

if (digits == '1') {
    buffer.set(bitIndex);
} else {
    buffer.clear(bitIndex);
}

You now have your huffman encoded data. But the resulting data will be impossible to decompress properly, since you are currently processing words and you do not write any indication where the compressed data actually ends (you do this currently with a line feed). If you encoded for example 3 times an 'a', the BitSet would contain 11 11 11. Thats 6 bits, but when you convert to byte[] its gets padded to 8 bits: 0b11_11_11_00.

Those extra, unavoidable bits will confuse your decompression. You will need to handle this somehow, either by encoding first the number of symbols in the data, or by using an explicit symbol signaling end of data.

This should get you an idea how to continue. Many details depend on how you implement your tree class and the encoded symbols.

Durandal
  • 19,919
  • 4
  • 36
  • 70