Simple compression algorithm - how to avoid marker-strings?

Question

I was trying to come up with a string compression algorithm for plaintext, e.g.

AAAAAAAABB -> A@8BB

where n symbols y are written out like

y@n

The problem is: what if I need to compress the string "A@8" ? That would confuse the decompression algorithm into thinking that the original input was "AAAAAAAA" instead of just "A@8".

How can I solve this problem? I was thinking of using a "marker" character instead of the @, but what if I wanted the algorithm to work with binary data? There is no marker character that can be used in that case I suppose

You can interleave [counts and non-repeated or repeated characters](https://en.wikipedia.org/wiki/PackBits). (With messages constituted of individual bits, this need a number representation that allows detecting the end of each number.) — greybeard, Nov 10 '15 at 09:21

score 1 · Accepted Answer · edited May 23 '17 at 12:23

1

A simple solution is escaping: you could represent each @ in the source by @@. Everytime you encounter an @, you look one character ahead and find either a number (repeat previous character) or another @ (its literally @).

A variant would be encoding each @ as @@1, which would fit nicely into your current scheme and allows encoding n consecutive @ as @@n.

edited May 23 '17 at 12:23

Community

1
1

answered Nov 10 '15 at 09:06

kubanrob

160
2
8

An alphabet with only two characters (`but what if I wanted the algorithm to work with binary sequences?`) doesn't easily allow for escaping. – greybeard Nov 10 '15 at 09:12
I've read _binary_ as _binary data_ (as opposed to _text_). You're right: with only two character, escaping wouldn't be a good option. – kubanrob Nov 10 '15 at 09:23
Oh sorry I meant "binary data", not really a sequence of 1s and 0s – Dean Nov 10 '15 at 10:22
@greybeard: It's beside the point, but an alphabet with only two characters *can* use escaping -- just use a multi-character escape sequence. E.g. `0` = 0, `10` = 1, `11` = "repeat count follows in some known format" – j_random_hacker Nov 10 '15 at 13:03

Simple compression algorithm - how to avoid marker-strings?

1 Answers1