0

I was trying to come up with a string compression algorithm for plaintext, e.g.

AAAAAAAABB -> A@8BB

where n symbols y are written out like

y@n

The problem is: what if I need to compress the string "A@8" ? That would confuse the decompression algorithm into thinking that the original input was "AAAAAAAA" instead of just "A@8".

How can I solve this problem? I was thinking of using a "marker" character instead of the @, but what if I wanted the algorithm to work with binary data? There is no marker character that can be used in that case I suppose

Dean
  • 6,610
  • 6
  • 40
  • 90
  • You can interleave [counts and non-repeated or repeated characters](https://en.wikipedia.org/wiki/PackBits). (With messages constituted of individual bits, this need a number representation that allows detecting the end of each number.) – greybeard Nov 10 '15 at 09:21

1 Answers1

1

A simple solution is escaping: you could represent each @ in the source by @@. Everytime you encounter an @, you look one character ahead and find either a number (repeat previous character) or another @ (its literally @).

A variant would be encoding each @ as @@1, which would fit nicely into your current scheme and allows encoding n consecutive @ as @@n.

Community
  • 1
  • 1
kubanrob
  • 160
  • 2
  • 8
  • An alphabet with only two characters (`but what if I wanted the algorithm to work with binary sequences?`) doesn't easily allow for escaping. – greybeard Nov 10 '15 at 09:12
  • I've read _binary_ as _binary data_ (as opposed to _text_). You're right: with only two character, escaping wouldn't be a good option. – kubanrob Nov 10 '15 at 09:23
  • Oh sorry I meant "binary data", not really a sequence of 1s and 0s – Dean Nov 10 '15 at 10:22
  • @greybeard: It's beside the point, but an alphabet with only two characters *can* use escaping -- just use a multi-character escape sequence. E.g. `0` = 0, `10` = 1, `11` = "repeat count follows in some known format" – j_random_hacker Nov 10 '15 at 13:03