0

Here is my code:

public static String compress(final String input)

{

HashMap<String, Integer> codes = new HashMap<String, Integer>();
for (int i = 0; i < 256; i++)
{
  codes.put((char) i + "", i);
}

StringBuilder outputString = new StringBuilder();

int max_code = 32767;
int next_code = 257;
String currentString = new String();
char c;

for (int i = 0; i < input.length(); i++)
{
  c = input.charAt(i);
  currentString = currentString + c;
  if (!codes.containsKey(currentString))
  {
    if (next_code <= max_code)
    {
      codes.put(currentString, next_code++);
    }
    currentString = currentString.substring(0, currentString.length() - 1);
    outputString.append(codes.get(currentString));
    currentString = c + "";
  }
}
outputString.append(codes.get(currentString));

return outputString.toString();

}

I have worked from the article: http://marknelson.us/2011/11/08/lzw-revisited/

I have read some articles stating that this method is naive and very slow: https://code.google.com/p/algorithms-and-datastructures-course/source/browse/trunk/AD_exercise_4/src/ad_exercise_4/controller/LZWNaive.java?r=38

How can I speed up the algorithm. At the moment it is taking 21 seconds to compress 3MB. Could somebody provide the pseudocode that I should be following to achieve quicker results. For example 1-2 seconds to compress 3MB.

I think the !HashMap.containsKey() is the line which costs an extreme amount of time. 16 out of the 21 seconds.

Regards.

Danny Rancher
  • 1,923
  • 3
  • 24
  • 43
  • Theoretically, your approach is right. Practically, it never works. I used to follow this route and my decoder was over 10 times slower. Given this rough estimate, it should only take about at most 2s in your case. Take a look at this http://stackoverflow.com/questions/10450395/lzw-decompression-algorithm/10452456#10452456 answer and especially the links provided there – dragon66 Oct 17 '14 at 21:19

2 Answers2

0

One thing of note. The String class is immutable in Java. In other words, appending to a String with the + operator actually creates a new String. A lot of String assignment operations will cause the dereferenced String objects to build up and trigger garbage collection and that's going to slow you down big time.

At the very least, I would recommend switching to StringBuffer. Without a lot of logic changes you should gain performance right away. But StringBuffer still is not the most memory efficient way to deal with binary data as it is tuned to handle information in differing character sets. For compression/decompression you don't care about the character sets, just the bits.

ByteBuffer in the java.nio package (Java 6) would be a giant leap forward.

David H. Bennett
  • 1,822
  • 13
  • 16
  • In the post I state that the actual assignment is not the time-intensive part. It is the !HashMap.containsKey() line. Since the String is an immutable class and the hashmap is cached, I believe I have already optimized this line. Creating an encapsulated ByteBuffer class would not speed up the solution in my opinion. – Danny Rancher Feb 01 '14 at 03:55
  • If you are running a small isolated test through a profiler you might not see the GC impact. I stand by my assertion that you should not be using immutable Strings for this operation. A String in Java is totally different than a string in C. I'm not sure why you would be be seeing such a heavy load on containsKey and not on the get method later. There must be a lot of postive hits on the evaluation. Have you considered using TreeMap instead of HashMap? Depending on the data set it may speed up your lookups. They use a bit more memory (about 20% more). – David H. Bennett Feb 01 '14 at 04:42
  • I did try to change to tree map however the compression output was no longer correct. I merely substituted it. Am I using it wrong? – Danny Rancher Feb 01 '14 at 17:16
  • There shouldn't be any difference as long as you are not setting a comparator. I don't think TreeMap will gain you a lot as you need discrete lookups and not iteration through the map. A couple of other things. If you use the startup cmdline -XX:+UseCompressedStrings which is available on Java 6 (removed from Java 7 I believe). This will cause Strings to be single byte characters which will reduce the size of your keys and increase hash computing performance. Also reconsider using ByteBuffer as your HashMap key size will be reduced by 1/2. – David H. Bennett Feb 01 '14 at 21:19
0

Several of the operations done on currentString are expensive, especially as the size of currentString grows.

The statement:

    currentString = currentString + c;

loops through all chars in the string and makes a copy of the full string + the new char.

The line:

    if (!codes.containsKey(currentString))

makes use of the hash code of currentString. As currentString is a new string each time, the hash code needs to be calculated by looping through the full string (kind of nullifies the usefulness of the hash if you need to calculate it each time).

And finally the line:

    currentString = currentString.substring(0, currentString.length() - 1);

also needs to loop through the full string and makes a new copy of it.

If you want to get this program to run fast, you need to eliminate the need for looping through the same data all the time. Don't create new String's each time you want to add or remove a char, instead use a buffer of some sort where you can add and remove chars at both ends. Also consider an alternative hash code scheme, so you don't need to recalculate the full hash (by looping over the full string) just because you extended currentString with a char.

Ebbe M. Pedersen
  • 7,250
  • 3
  • 27
  • 47
  • I'm seriously puzzled as to why this is down voted?! – Ebbe M. Pedersen Mar 26 '14 at 15:54
  • Its incorrect. The solution to the problem was to rewrite the algorithm after realising that each dictionary entry can be defined as a previous code and a character, not using buffers as your answer suggested. – Danny Rancher Oct 20 '14 at 09:10
  • @Danny Rancher That you solved your problem by rewriting your algorithm don't make my answer incorrect. Your implementation in your question does excessive unnecessary String manipulation that makes the implementation slow - and this I point out in my answer. – Ebbe M. Pedersen Oct 20 '14 at 11:14
  • Please do not think I dismissed your answer. I implemented it and it within my solution where it made minimal difference before I rewrote the algorithm. I thank you for your answer, but it did not provide a feasible solution. – Danny Rancher Oct 20 '14 at 11:40