A quick but sufficient definition of LZ78 is the one from Wikipedia:
Each dictionary entry is of the form dictionary[...] = {index, character}, where index is the index to a previous dictionary entry, and character is appended to the string represented by dictionary[index]. For example, "abc" would be stored (in reverse order) as follows: dictionary[k] = {j, 'c'}, dictionary[j] = {i, 'b'}, dictionary[i] = {0, 'a'}, where an index of 0 specifies the first character of a string. The algorithm initializes last matching index = 0 and next available index = 1.
For each character of the input stream, the dictionary is searched for a match: {last matching index, character}.
If a match is found, then last matching index is set to the index of the matching entry, and nothing is output.
If a match is not found, then a new dictionary entry is created: dictionary[next available index] = {last matching index, character}, and the algorithm outputs last matching index, followed by character, then resets last matching index = 0 and increments next available index. Once the dictionary is full, no more entries are added.
When the end of the input stream is reached, the algorithm outputs last matching index.
The last sentence is a serious problem to me when thinking about implementation. Ok, the stream of output is of the form (index,letter)...(index,letter)(index).
But as any implementation needs to use bytes (or alike that is not really important) in the general case we have a padding. So how to get the decoder not be fooled with the padding?
I know that there exists some tricks, for example if I have the total length of the original string then it is easy to stop the decoder. But, in that case LZ78 is no more a stream compressor. Another example would be to extend the alphabet to have a special char for the terminal case, but this would use at least one more bit for letter encoding which is not acceptable to me. Again, if the character set contains all possible bytes then there is no problem as any step of output generates at least 8 bits (index+letter) so it is easy to know if we are at the end or not.
But in the general case of LZ78 you can have any alphabet. For example if alphabet have only two elements 0 and 1, I can't understand how not to be fooled by the padding. I mean how to distinguish from (index,padding) and (index,letter)?
How to distinguish in between encoding of 00 and 000?
- (0,0)(1)+padding (raw: 001+padding)
- (0,0)(1,0)+padding (raw: 0010+padding)
Did I missed a very simple point?
Note that papers even the original one from Lempel & Ziv says nothing about this. All implementations I've found and analyzed use one of the tricks I listed (or a variant).