LZ78 implementation

Question

A quick but sufficient definition of LZ78 is the one from Wikipedia:

Each dictionary entry is of the form dictionary[...] = {index, character}, where index is the index to a previous dictionary entry, and character is appended to the string represented by dictionary[index]. For example, "abc" would be stored (in reverse order) as follows: dictionary[k] = {j, 'c'}, dictionary[j] = {i, 'b'}, dictionary[i] = {0, 'a'}, where an index of 0 specifies the first character of a string. The algorithm initializes last matching index = 0 and next available index = 1.

For each character of the input stream, the dictionary is searched for a match: {last matching index, character}.

If a match is found, then last matching index is set to the index of the matching entry, and nothing is output.

If a match is not found, then a new dictionary entry is created: dictionary[next available index] = {last matching index, character}, and the algorithm outputs last matching index, followed by character, then resets last matching index = 0 and increments next available index. Once the dictionary is full, no more entries are added.

When the end of the input stream is reached, the algorithm outputs last matching index.

The last sentence is a serious problem to me when thinking about implementation. Ok, the stream of output is of the form (index,letter)...(index,letter)(index).

But as any implementation needs to use bytes (or alike that is not really important) in the general case we have a padding. So how to get the decoder not be fooled with the padding?

I know that there exists some tricks, for example if I have the total length of the original string then it is easy to stop the decoder. But, in that case LZ78 is no more a stream compressor. Another example would be to extend the alphabet to have a special char for the terminal case, but this would use at least one more bit for letter encoding which is not acceptable to me. Again, if the character set contains all possible bytes then there is no problem as any step of output generates at least 8 bits (index+letter) so it is easy to know if we are at the end or not.

But in the general case of LZ78 you can have any alphabet. For example if alphabet have only two elements 0 and 1, I can't understand how not to be fooled by the padding. I mean how to distinguish from (index,padding) and (index,letter)?

How to distinguish in between encoding of 00 and 000?

(0,0)(1)+padding (raw: 001+padding)
(0,0)(1,0)+padding (raw: 0010+padding)

Did I missed a very simple point?

Note that papers even the original one from Lempel & Ziv says nothing about this. All implementations I've found and analyzed use one of the tricks I listed (or a variant).

How about entropy coding the indices and symbols? Then you can add an EOF symbol to signal the end of the stream and code it efficiently. Kinda like DEFLATE does it, IIRC. — Dan Mašek, Mar 08 '18 at 16:03
@DanMašek What stupid I am, of course! I was focused on original description that did not talk about entropy encoding of the letters! — Jean-Baptiste Yunès, Mar 08 '18 at 16:29
@DanMašek Thinnking at it again does make me feel very comfortable. In the case of 0/1 alphabet + stop-char one of the letter will be coded with one bit and the other by 2 which will use more bits than in the raw definition. — Jean-Baptiste Yunès, Mar 08 '18 at 19:49
Not necessarily. There are approaches, such as arithmetic coding, which can code close to entropy. Say, you have 500 zeros and 500 ones, followed by 1 EOF. You can code that with approx 1.01 bits/symbol. Now say you have 800 zeros and 200 ones, plus 1 EOF. That one can be coded with approx 0.73 bits/symbol. — Dan Mašek, Mar 09 '18 at 00:03
[Here](https://gist.github.com/dan-masek/2c6e08742bcda5c494483f4e3206b5bf) is a little proof of concept in C++ using Amir Said's FastAC to illustrate the above. I used an adaptive model for symbol frequencies (in the beginning all the probabilities are equal), on randomly shuffled input of 1000 zeros and ones (of several different proportions) terminated by a single EOF. — Dan Mašek, Mar 09 '18 at 00:44
@DanMašek Thinking of it again during the night sleep, made me more definitely comfortable with this, now. Was too busy. Thanks! Please take time to make a little answer please... — Jean-Baptiste Yunès, Mar 09 '18 at 08:58

LZ78 implementation

0 Answers0