0

One of my friends got this interview question. In addition, he was told he could assume the characters were letters a to z (upper or lower case). I wrote the following, but I can't figure out how to use the assumption about the limited characters (a to z) the string contains. Am I using this assumption without realizing it or can I make use of it?

  public static String compress(String str){
    int count = 1;
    char c = str.charAt(0);
    StringBuffer result = new StringBuffer();

    for (int i = 1; i < str.length();i++){
      if (str.charAt(i) == c){
        count++;
      }
      else{
        String to_add = c + String.valueOf(count);
        result.append(to_add);
        count = 1;
        c = str.charAt(i);
      }
    }
    // last character
    String to_add = c + String.valueOf(count);
    result.append(to_add);

    String result_str = result.toString();

    // Check whether the compressed string is
    // actually smaller than the original one
    if (result_str.length() < str.length()){
      return result_str;
    }
    else{
      return str;
    }
  }
Community
  • 1
  • 1
giulio
  • 659
  • 2
  • 8
  • 22

2 Answers2

0

'a' to 'Z' is 2*26=52 distinct characters, and it fits in 6-bits (2^6=64). You could just pack the code-points into sextets.

OTOH, RLE (what you have coded) works only for repetitions. If you have input like abcde it would turn into 1a1b1c1d1e or something alike, which is highly inefficient and you can hardly call it compression.

milan
  • 2,355
  • 2
  • 23
  • 38
0

Assign each character to a number, eg a = 1, z = 26. So, to represent these 26 characters you need at least 5 bits.

You can now use 2 bytes (16 bits) to store a triplet of characters. This requires 1/3 less bytes than the initial one byte per character (if ascii). To store a triplet of characters read bits from your bytes (for example left to right).

  1. First five bits of the first byte will represent the first character
  2. The next three bits of the first byte, concatenated with the first two bits of the second byte represent the second
  3. the next five bits from second byte represent the third character
  4. there is one bit left (ignore it)

*To slightly improve in compression size, if your String's length % 3 = 1, then for the last character of your String you can use one byte only as you don't have another triplet.

**You can get if a specific bit is set on a byte using the algorithm from this post, which is:

public byte getBit(byte b, int position)
{
   return (b >> position) & 1;
}

***You can set a bit to a byte using the algorithms from this post, which are:

to set a bit (set it to one)

b = b | (1 << position);

To unset a bit (set it to zero):

b = b & ~(1 << position);

****Using maths (least common multiple of 5 and 8), you could even slightly improve in compression size if you used 5 bytes = 40bits, which can represent 8 characters (8x5=40).

Then you would store octets of characters and there are no bits to ignore now. For the last characters of your String, depending on (string size % 8), you could again use less bytes.

*****Using the last 5-byte approach you get 3/8 less size, which is better than 1/3 of the 3-byte approach.

Community
  • 1
  • 1
Kostas Kryptos
  • 4,081
  • 2
  • 23
  • 24