1

Maybe there are any way to compress small strings(86 chars) to something smaller?

@a@1\s\215\c\6\-0.55955,-0.766462,0.315342\s\1\x\-3421.-4006,3519.-4994,3847.1744,sbs

The only way I see is to replace the recurring characters on a unique character. But i can't find something about that in google. Thanks for any reply.

MPelletier
  • 16,256
  • 15
  • 86
  • 137
SLI
  • 713
  • 11
  • 29
  • 2
    Here's an idea: http://en.wikipedia.org/wiki/Huffman_coding – BlackBear Apr 08 '12 at 17:13
  • There's no generic way to do this. If your characters can only take on certain values, then perhaps something like base-64 encoding would help. An entropy-based system (e.g. Huffman) or a dictionary-based system (e.g. LZW) can give no guarantees about size reduction for individual strings. – Oliver Charlesworth Apr 08 '12 at 17:14
  • Is the range of the character set, in terms of ascii codes, less than 128? E.g. if you only use codes 32 to 140. Then you can represent each character with <= 7 bits and save some space using overlapping representations. I.e. the first 7 bytes will represent 8 characters. – Matt Phillips Apr 08 '12 at 17:14
  • possible duplicate of [Really simple short string compression](http://stackoverflow.com/questions/1192732/really-simple-short-string-compression) – Magnus Johansson Apr 08 '12 at 17:16
  • .. and: [Best compression algorithm for short text strings](http://stackoverflow.com/questions/1138345/best-compression-algorithm-for-short-text-strings) – Magnus Johansson Apr 08 '12 at 17:17

3 Answers3

2

http://en.wikipedia.org/wiki/Huffman_coding Huffman coding would probably be pretty good start. In general the idea is to replace individual characters with the smallest bit pattern needed to replicate the original string or dataset.

You'll want to run statistical analysis on a variety of 'small strings' to find the most common characters so that the more common characters will be represented with the smallest unique bit patterns. And possibly makeup a 'example' small string with every character that will need to be represented (like a-z0-9@.0-)

Jared Kipe
  • 1,189
  • 6
  • 5
  • Huffman coding is a killer for what this guy is looking to do. As you mention, he would need to know the relative probability for the technique to be meaningful. – kasavbere Apr 08 '12 at 17:36
  • Thats not exactly true, it would still offer pretty good compression given the compression from ASCII to a smaller bit space using only the needed, and relatively small, character set that these strings represent. It won't be 'optimal' unless there is a strong statistical relationship to the data. – Jared Kipe Apr 08 '12 at 17:47
1

I took your example string of 85 bytes (not 83 since it was copied verbatim from the post, perhaps with some intended escapes not processed). I compressed it using raw deflate, i.e. no zlib or gzip headers and trailers, and it compressed to 69 bytes. This was done mostly by Huffman coding, though also with four three-byte backward string references.

The best way to compress this sort of thing is to use everything you know about the data. There appears to be some structure to it and there are numbers coded in it. You could develop a representation of the expected data that is shorter. You can encode it as a stream of bits, and the first bit could indicate that what follows is straight bytes in the case that the data you got was not what was expected.

Another approach would be to take advantage of previous messages. If this message is one of a stream of messages, and they all look similar to each other, then you can make a dictionary of previous messages to use as a basis for compression, which can be reconstructed at the other end by the previous messages received. That may offer dramatically improved compression if they messages really are similar.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
0

You should look up RUN-LENGTH ENCODING. Here is a demonstration

rrrrrunnnnnn    BECOMES    5r1u6n     WHAT? truncate repetitions: for x consecutive r use xr

Now what if some of the characters are digits? Then instead of using x, use the character whose ASCII value is x. for example, if you have 43 consecutive P, write +P because '+' has ASCII code 43. If you have 49 consecutive y, write 1y because '1' has ASCII code 49.

Now the catch, which you will find with all compression algorithms, is if you have a string with little or no repetitions. Then in that case your code may be longer than the original word. But that's true for all compression algorithms.

NOTE:

I don't encourage using Huffman coding because even if you use the Ziv-Lempel implementation, it's still a lot of work to get it right.

kasavbere
  • 5,873
  • 14
  • 49
  • 72