8

I have a string exactly 53 characters long that contains a limited set of possible characters.

[A-Za-z0-9\.\-~_+]{53}

I need to reduce this to length 50 without loss of information and using the same set of characters.

I think it should be possible to compress most strings down to 50 length, but is it possible for all possible length 53 strings? We know that in the worst case 14 characters from the possible set will be unused. Can we use this information at all?

Thanks for reading.

diolemo
  • 2,621
  • 2
  • 21
  • 28
  • 1
    "We know that in the worst case 14 characters from the possible set will be unused." Can you elaborate on this? – John Kugelman Nov 20 '12 at 20:56
  • 3
    I believe that *encoding* is desired here, not *compression* –  Nov 20 '12 at 20:57
  • @JohnKugelman 53 character string with each character different. 67 possible characters (52 letters, 10 numbers, 5 symbols). 67-53 = 14 not used. – diolemo Nov 20 '12 at 20:57

5 Answers5

11

If, as you stated, your output strings have to use the same set of characters as the input string, and if you don't know anything special about the requirements of the input string, then no, it's not possible to compress every possible 53-character string down to 50 characters. This is a simple application of the pigeonhole principle.

  • Your input strings can be represented as a 53-digit number in base 67, i.e., an integer from 0 to 6753 - 1 ≅ 6*1096.
  • You want to map those numbers to an integer from 0 to 6750 - 1 ≅ 2*1091.
  • So by the pigeonhole principle, you're guaranteed that 673 = 300,763 different inputs will map to each possible output -- which means that, when you go to decompress, you have no way to know which of those 300,763 originals you're supposed to map back to.

To make this work, you have to change your requirements. You could use a larger set of characters to encode the output (you could get it down to 50 characters if each one had 87 possible values, instead of the 67 in the input). Or you could identify redundancy in the input -- perhaps the first character can only be a '3' or a '5', the nineteenth and twentieth are a state abbreviation that can only have 62 different possible values, that sort of thing.

If you can't do either of those things, you'll have to use a compression algorithm, like Huffman coding, and accept the fact that some strings will be compressible (and get shorter) and others will not (and will get longer).

Joe White
  • 94,807
  • 60
  • 220
  • 330
  • I like where this is going (+1) but this doesn't *directly* address the part of the question: *If* the 14 unused characters are identified then the input space is only `53^53`. Now, identifying and using such information is another issue but .. does it get anywhere? If not, why not? –  Nov 20 '12 at 21:52
  • 1
    No, the input space is still `67^53`. "14 input characters unused" is irrelevant. It's like saying "if I have a 3-digit number, then at least 7 possible digits are unused". Well sure, but who cares? That doesn't somehow change the `10^3` combinations to `3^3`. – Joe White Nov 21 '12 at 05:21
  • *If the unused characters are identified [known]* - say I know, perhaps because I looked over a shoulder while someone was keying in code, that only the first row (1,2,3) was used .. does this change the problem any? –  Nov 21 '12 at 05:34
  • @pst, absolutely. That's not what "14 characters from the possible set will be unused" part was about, but if you *did* know that the 3-digit number was made up of only three possible digits, then it becomes a 3-digit number in base 3, not base 10, so then it would be `3^3`. – Joe White Nov 21 '12 at 19:39
5

What you ask is not possible in the most general case, which can be proven very simply.

Say it was possible to encode an arbitrary 53 character string to 50 chars in the same set. Do that, then add three random characters to the encoded string. Then you have another arbitrary, 53 character string. How do you compress that?

So what you want can not be guaranteed to work for any possible data. However, it is possible that all your real data has low enough entropy that you can devise a scheme that will work.

In that case, you will probably want to do some variant of Huffman coding, which basically allocates variable-bit-length encodings for the characters in your set, using the shortest encodings for the most commonly used characters. You can analyze all your data to come up with a set of encodings. After Huffman coding, your string will be a (hopefully shorter) bitstream, which you encode to your character set at 6 bits per character. It may be short enough for all your real data.

A library-based encoding like Smaz (referenced in another answer) may work as well. Again, it is impossible to guarantee that it will work for all possible data.

antlersoft
  • 14,636
  • 4
  • 35
  • 55
5

One byte (character) can encode 256 values (0-255) but your set of valid characters uses only 67 values, which can be represented in 7 bits (alas, 6 bits gets you only 64) and none of your characters uses the high bit of the byte.

Given that, you can throw away the high bit and store only 7 bits, running the initial bits of the next character into the "spare" space of the first character. This would require only 47 bytes of space to store. (53 x 7 = 371 bits, 371 / 8 = 46.4 == 47)

This is not really considered compression, but rather a change in encoding.

For example "ABC" is 0x41 0x42 0x43

     0x41        0x42        0x43  // hex values
0100 0001   0100 0010   0100 0011  // binary
 100 0001    100 0010    100 0011  // drop high bit
// run it all together
100000110000101000011
// split as 8 bits (and pad to 8)
10000011   00001010   00011[000]
    0x83       0x0A        0x18

As an example these 3 characters won't save any space, but your 53 characters will always come out as 47, guaranteed.

Note, however, that the output will not be in your original character set, if that is important to you.

The process becomes:

original-text --> encode --> store output-text (in database?)
retrieve --> decode --> original-text restored
Stephen P
  • 14,422
  • 2
  • 43
  • 67
  • @wared - what I mean is, if you look at the output of my encode method it will look nothing like the input - "abc" could become "•=" - because I've shifted every bit to the left. It *is* lossless, so decoding will always get the correct original string. Not it doesn't save any space until you exceed 7 char input. – Stephen P Feb 03 '14 at 18:07
  • @wared Yes, it's possible to get the original string back (lossless). Encoding to take 8 bits, drop high, shift left, and append the next re-encoded 8 bits; decoding you take the first 7 bits of the stream, shift-right and zero-fill by 1 bit, getting back the original 8 bits. Since you're actually reading a byte stream you'll want to save the low-order bit to use as the first bit of the next sequence. Note all of this works only because the original input (question) guarantee the high-bit was always zero. – Stephen P Feb 03 '14 at 20:39
3

If I remember correctly Huffman coding is going to be the most compact way to store the data. It has been too long since I used it to write the algorithm quickly, but the general idea is covered here, but if I remember correctly what you do is:

  1. get the count for each character that is used
  2. prioritize them based on how frequently they occurred
  3. build a tree based off the prioritization
  4. get the compressed bit representation of each character by traversing the tree (start at the root, left = 0 right = 1)
  5. replace each character with the bits from the tree
zbrunson
  • 1,587
  • 1
  • 12
  • 13
  • I don't think this will work in the case where each character is different. From what I understand the first occurrence of a character would need to be encoded as normal and then it is added to the internal tree structure. – diolemo Nov 20 '12 at 21:06
  • This is true, you do need a leading representation of what the encoding translates to, so it may not work on strings as short as the ones you are using. – zbrunson Nov 20 '12 at 21:08
  • check [this](http://stackoverflow.com/questions/1138345/best-compression-algorithm-for-short-text-strings) question. – zbrunson Nov 20 '12 at 21:15
2

Smaz is a simple compression library suitable for compressing very short strings.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Rahul Tripathi
  • 168,305
  • 31
  • 280
  • 331
  • 1
    Can you be sure that this will compress by 3 characters for every possible length 53 string from that character set once the result is encoded using the character set once again? – diolemo Nov 20 '12 at 21:11
  • 1
    Looking at the examples this probably won't work for this input as smaz only compresses natural language well because it uses a dictionary instead of a generic algorithm. – Christoph Nov 20 '12 at 21:30
  • Smaz looked also promising to me, wanted to use JSmaz in a Java project. Unfortunately, only ASCII input is supported at this time. – dajood Jun 05 '13 at 15:34