3

ukkonen's on line construction algorithm

i got a problem trying to understand the 'test and split' procedure,which is as follows: procedure test–and–split(s, (k, p), t):

>1. if k ≤ p then
>2. let g'(s,(k',p'))=s' be the tk-transition from s
>3. if t=t(k'+p-k+1) then return (true,s)

my problem is that what exactly does the 2nd line mean,how can g'(s,(k',p'))be still a tk-transition if it starts from s and followed by t(k') instead of t(k)??

1 Answers1

1

Probably you already figured it out and you don't need an answer anymore, but since I had the same problem in trying to understand it, and maybe it'll be useful for someone else in the future, the answer I think is the following one.

In Ukkonen's on line construction algorithm, on page 7 you can read that:

...
The string w spelled out by the transition path in STrie(T) between two explicit states s and r is represented in STree(T) as generalized transition g′(s,w) = r. To save space the string w is actually represented as a pair (k,p) of pointers (the left pointer k and the right pointer p) to T such that tk . . . tp = w. In this way the generalized transition gets form g′(s, (k, p)) = r.
Such pointers exist because there must be a suffix Ti such that the transition path for Ti in STrie(T) goes through s and r. We could select the smallest such i, and let k and p point to the substring of this Ti that is spelled out by the transition path from s to r. A transition g′(s, (k, p)) = r is called an a–transition if tk = a. Each s can have at most one a–transition for each a ∈ Σ.
...

This means that we are looking for the smallest indexes k and p such that tk . . . tp = w in T
=> if there is more than one occurrence of w in T, with k and p we always reference the first one.

Now, procedure test–and–split(s,(k,p),t) tests whether or not a state with canonical reference pair (s,(k,p)) is the endpoint, that is, a state that in STrie(T i−1) would have a ti –transition. Symbol ti is given as input parameter t.

The first lines of the algorithm are the following:

   procedure test–and–split(s,(k,p),t):
1.   if k ≤ p then
2.     let g′(s,(k′,p′)) = s′ be the t(k)–transition from s;
3.     if t = t(k′+p−k+1) then return(true,s)
4.     else ...

On line 1 we check if the state is implicit (that is when k <= p).

If so, then on line 2 we want to find the transition from s that starts with the character we find in pos k of T (that is tk). Note that tk must be equal to tk' but indexes k and k' can be different because we always point to the first occurrence of a string w in T (remember also that from one state there can be at most one transition that starts with character tk => so that's the correct and the only one).

Then on line 3 we check if the state referenced by the canonical reference pair (s,(k,p)) is the endpoint, that is if it has a ti -transition. The state (s,(k,p)) is the one (implicit or not) that we can reach from state s, following the tk' -transition (that is the tk-transition because k' = k) for (p - k) characters. This explains the tk′+p−k+1, where the +1 is for the next character, the one that we are checking if it is equal to t (where t = ti). In that case we reached the endpoint and we return true.

Else, starting from line 4, we split the transition g′(s,(k′,p′)) = s′ to make explicit the state (s,(k,p)) and return the new explicit state.

Tommy
  • 208
  • 2
  • 8
  • I was stuck at exactly the same spot in the paper and I’m glad that I found your answer! – pterodragon Dec 09 '18 at 09:25
  • 1
    @pterodragon From state *s* we want to read the string *w=t(k)...t(p)* represented by the pair of pointers *(k,p)*. From *s* there can be only one *t(k)*-transition (meaning a transition that starts with character *t(k)*), but if in *T* there is another string *w'=t(k')...t(p')* such that *w'=w* and *k' – Tommy Dec 10 '18 at 12:15
  • It seems like in the paper, the "cacao" string doesn't make k' != k for any s(j) in the construction process. Can you give a concrete example? – pterodragon Dec 11 '18 at 01:17
  • 1
    Actually, I tried to build the suffix tree for "cacao" with my python implementation of the algorithm (which you can find [here](https://github.com/tommy91/Algorithms/blob/master/suffixTree.py)) and it uses k'!=k in 3 calls of the function test-and-split. As you can see on my code, I inserted lines 105-107 to notify the use of k'!=k. – Tommy Dec 11 '18 at 21:25
  • There seems to be something wrong about "looking for the smallest indexes k and p such that t_k . . . t_p = w in T" in your answer. For example for the string "aabaac", the final STree has a transition path for 'a' not from the root but from r = (root, "a"). That transition is g'(r, (2, 2)) = r' rather than g'(r, (1, 1)) (Following 1-indexed notation in the paper. In your code it's 0-indexed) – pterodragon Feb 20 '19 at 10:06
  • In the paper it says "There must be a suffix T_i such that the transition path for T_i in STrie(T) goes through s and r. Select the smallest such i". For the string "aabaac", the transition g'(r, (2, 2)) = r' I mentioned above derives from the suffix "aabaac" (i = 1) as the transition path for "aabaac" in STrie(T) goes through r and r'. And the substring corresponding to that transition path is starting from the second 'a', which is "a" of a"a"baac. So, k is 2 rather than 1. Please correct me if I'm wrong. – pterodragon Feb 20 '19 at 10:46
  • Closing this answer, I just found [this brilliant explanation](https://stackoverflow.com/a/9513423/6437176). If you read it, at the Step 4 of "First extension: Simple repetitions" you understand why k' can be different from k, answering the original question. Simply, to build the tree we move to the right of our word one letter at the time. To extend the tree with the next considered char (say A), at the current node (implicit or explicit) we look for a transition starting with that char (A). But if that transition exists it refers to a previous occurrence of that char (A), so k' < k. – Tommy Feb 21 '19 at 15:59