0

Can someone explain to me Lempel-Ziv 76 complexity? I was under the impression that you initialize with the first letter of the string in your dictionary, and then check subsequent blocks for existence in the previous substring, growing one letter each time a substring is found. If no substring exists in the previous substring, that substring is called a block and the next letter becomes the next substring to be searched.

For example,

01011010001101110010

0|1

since 1 is not in 0, we get 0|1|0

since 0 is in 01, we get 0|1|01

since 01 is in 01, we get 0|1|011|0

since 0 is in 01011, we get 0|1|011|01

since 01 is in 01011, we get 0|1|011|010

since 010 is in 01011, we get 0|1|011|0100|0

and so on until, we get 0|1|011|0100|011011|1001|0,

where the last letter can be a repeat if necessary.

What am I doing wrong? Because I am being told that for a string 1111111, the decomposition is 1|111111. Thanks!

Charles
  • 50,943
  • 13
  • 104
  • 142
aznpwnzor
  • 11
  • 1
  • 2

2 Answers2

3

I agree with your decomposition:

01011010001101110010 -> 0.1.011.0100.011011.1001.0

I also believe that what you were told is correct:

1111111 -> 1.111111

However, how you arrived at your original decomposition is not quite right! Hence the confusion about decomposing 1111111. I think, according to your reasoning:

1111111 -> 1.11.1111

which I'm pretty sure is not correct.

Extensions to the existing sequence of words (blocks) is not as simple as checking to see if the extension previously appears as a substring of the previous history. It boils down to the concept of reproducibility of an extension that Lempel and Ziv describe in their paper On the Complexity of Finite Sequences (I'm assuming that's what you're working from!). An extension is formed such that it is the shortest word that is not reproducible from the previous history. The concept of reproducibility of an extension that they describe is rather complicated, but it boils down to being able to find a starting position in the previous history, from which you can sequentially copy symbols from that starting position, onto the end of the history, to form the extension.

From your original sequence, assume the symbols have positions from 1 to 20. The first symbol is always a word/block by itself:

0.

The next extension is not reproducible from the previous history:

0.1.

The next extension is:

0.1.011.

The reason why it can't be 0 or 01 is as follows: 0 is reproducible from the previous history, by copying one symbol from position 1 to the end; 01 is reproducible by copying two symbols from position 1 to the end; 011 is not reproducible.

The next extension is:

0.1.011.0100.

0 is reproducible by copying one symbol from position 1 to the end; 01 is reproducible by copying two symbols from position 1 to the end; 010 is reproducible by copying three symbols from position 1 to the end; 0100 is not reproducible.

And so on.

The decomposition of 1111111 is as follows: the first symbol is a block by itself:

1.

The next extension is:

1.111111

1 is reproducible by copying one symbol from position 1 to the end of the previous history. 11 is reproducible by copying two symbols from position 1 to the end. This is where it gets complicated - in this case, when you copy two symbols, the source of the copy actually extends past the end of the previous history! In other words, the second 1 that you copy is actually part of the extension that results from copying the first 1 onto the end. Similarly, 111, 1111, 11111, 111111 are all reproducible, due to this recursive copying process. However, since we have now reached the end of the original sequence, the last extension is deemed to be 111111.

Hopefully my explanation has made some sense.

qootee
  • 31
  • 2
2

This paper does not agree with your description of the algorithm. According to the paper, you have a new partition if it doesn't match any previous partition. You don't get to make partitions based on the entire unpartitioned preceding sequence, as you have done. So for your examples (I am using . instead of | to partition, since that's easier to read):

01011010001101110010

partitions into:

    0.1.01.10.100.011.0111.00.10

so the LZ76 weight is 9 (not 7).

1111111

partitions into:

1.11.111.1

They both provide an example of the case where the final partition is contained in a previous one. Hence the < r instead of <= r in the definition.

I don't have the original paper, so I can't check whether this paper got it wrong or not. However I doubt that they incorrectly copied the definition from the original paper.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158