Ukkonen suffix tree: procedure 'canonize' unclear

Question

How does the 'canonize' function (given below, from Ukkonen's paper) work, and in particular, when is the while loop finished? I think the value of p' - k' will always remain less than that of p - k. Am I right or wrong?

procedure canonize(s, (k, p)):
1. if p < k then return (s, k)
2. else
3. find the tk–transition g'(s, (k', p')) = s' from s;
4. while p' − k' <= p − k do
5.     k  = k + p' − k' + 1;
6.     s  = s';
7.     if k <= p then find the tk–transition g'(s, (k', p')) = s' from s;
8. return (s, k).

I think it would be good to add the exact reference to the paper. Is this the 1995 paper that appeared in _Algorithmica_? In that case, the reference would be DOI 10.1007/BF01206331. — jogojapan, Apr 11 '12 at 02:04
The paper can be located at http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf — Abhishek, Apr 11 '12 at 10:28

score 8 · Accepted Answer · edited May 23 '17 at 11:48

What the canonize function does is what is described at the very end of this SA post, where we consider a situation like this:

enter image description here

The situation is this:

The active point is at (red,'d',3), i.e. three characters into the defg edge going out of the red node.
Now we follow a suffix link to the green node. In theory, our active node is now (green,'d',3).
Unfortunately that point does not exist, because the de edge going out of the green node has got only 2 characters. Hence, we apply the canonize function.

It works like this:

The starting character of the edge we are interested in is d. This character is referred to as t_k in Ukkonen's notation. So, "finding the t_k-edge" means finding the de edge at the green node.
That edge is only two characters in length. I.e. (p' - k') == 2 in Ukkonen's notation. But the original edge had three characters: (p - k) == 3. So <= is true and we enter the loop.
We shorten the edge we are looking for from def to f. This is what the p := p + (k' - p') + 1 step does.
We advance to the state the de edge points to, i.e. the blue state. That is what s := s' does.
Since the remaining part f of the edge is not empty (k <= p), we identify the relevant outgoing edge (that is the fg edge going out of the blue node). This step sets k' and p' to entirely new values, because they now refer to the string fg, and its length (p' - k') will now be 2.
The length of the remaining edge f, (p - k), is now 1, and the length of the candidate edge fg for the new active point, (p' - k'), is 2. Hence the loop condition

while (p' - k') <= (p - k) do

is no longer true, so the loop ends, and indeed the new (and correct) active point is (blue,'f',1).

[Actually, in Ukkonen's notation, the end pointer p of an edge points to the position of the final character of the edge, not the position that follows it. Hence, strictly speaking, (p - k) is 0, not 1, and (p' - k') is 1, not 2. But what matters is not the absolute value of the length, but the relative comparison of the two different lengths.]

A few final notes:

Pointers like p and k refer to positions in the original input text t. That can be quite confusing. For example, the pointers used in the de edge at the green node will refer to some substring de of t, and the pointers used in the fg edge at the blue node will refer to some substring fg of t. Although the string defg must appear as one continuous string somewhere in t, the substring fg might appear in other places, too. So, the pointer k of the fg edge is not necessarily the end pointer p of the de edge plus one.
What counts when we decide whether or not to end the loop is therefore not the absolute positions k or p, but the length (p - k) of the remaining edge compared to the length (p' - k') of the current candidate edge.
In your question, line 4 of the code snippet, there is a typo: It should be k' instead of k;.

Thanx for the answer, it cleared every aspect of canonize. It will be very helpful to me and all other peoples who are having problem with understanding suffix tree if you could provide an explanation lke this for update, test-and-split and specially Algorithm 2 's last line which calls canonize after calling update............Once again cheers for you and thanx....... — Abhishek, Apr 12 '12 at 02:29
I agree, it would indeed be a good idea to put the original names of these various functions in the comprehensive SA post of the algorithm I mentioned above. I shall do that some time soon. But if further questions remain then (which is very possible), we should treat those in separate SA posts. — jogojapan, Apr 12 '12 at 06:28

score 0 · Answer 2 · answered Oct 17 '14 at 13:04

0

I had to change the canonize function because it didn't handle the auxiliary state properly. I added the following code after the 'p < k' if:

if (s == auxiliary)
{
    s = root;
    k++;
    if (p < k)
        return;
}

It seems to work now :)

answered Oct 17 '14 at 13:04

Kris De Greve

1

Ukkonen suffix tree: procedure 'canonize' unclear

2 Answers2

Linked