3

Question

Gusfield describes Ukkonen's algorithm for the construction of suffix trees here, but I'm having difficulty synthesizing his account with the demonstration given in jogojapan's SO answer and, as a result, struggling with deriving a working implementation from Gusfield's description.

Particularly, I'm having trouble integrating the active point implementation described by jogojapan with Gusfield's description of the suffix extension rules of Ukkonen's algorithm.

In short, my question is: does B (beta) in Gusfield's definitions of the suffix extension rules correspond to the active point in jogojapan's answer?

Context

As far as I can tell, the active point targets the location of the next node insertion in the tree, which corresponds to the insertion of a suffix insofar as it creates a new branch that ends in a leaf.

The active point appears to come into play after the application of what Gusfield describes as suffix extension Rule 1 to every existing leaf using the global index described in his "Trick 3". So far, this seems consistent with jogojapan's answer.

However, Gusfield's text begins to escape me when it is time to apply either Rule 2 or Rule 3 after the application of Rule 1. Gusfield outlines the "high-level algorithm" as follows:

Construct tree I
For *i* from 1 to *m* - 1 do:
Begin {phase *i* + 1}
  For *j* from 1 to *i* + 1
    Begin {extension *j*}
    Find the end of the path from the root labeled S[*j*..*i*] in the current tree and, if needed, extend that path by adding character S(*i* + 1), thus assuming that string S[*j*..*i*+1] is in the tree.
    End;
End;

From this account, I'm left wondering what happens in extention j = i + 1. As stated, the for-loop runs up until this value of j, but the body of the loop only specifies behavior up to j = *i. In other words, it appears that continuing with the algorithm for the final extension will result in some sort of buffer overflow or out-of-bounds error.

Gusfield's definitions of the suffix extension rules appear to suffer from the same problem. Specifically, he writes (B substituted for Greek beta):

Let S[j..i] = B be a suffix of S[1..i]. In extension j, when the >algorithm finds the end of B in the current tree, it extends B to be sure >the suffix BS(i + 1) is in the tree.

Rule 1 In the current tree, path B ends at a leaf. That is, the path >from the root labeled B extends to the end of some leaf edge. To update the >tree, character S(i + 1) is added to the end of the label on that edge.

Rule 2 No path from the end of string B starts with character S(i + 1), >but at least one labeled path continues from the end B. In this case, a new leaf edge starting from the end of B mus be created >and labeled with character S(i + 1). A new node will also have to be >created there if B ends inside an edge. The lead at the end of the new edge >is given the number j.

Rule 3 Some path from the end of string B starts with character S(i + >1). In this case, the string BS(i + 1) is already in the current tree, so >(remembering that in an implicit suffix tree the end of a suffix need not >be explicitly marked) we do nothing.

Again, it appears that extension j = i + 1 of phase i + 1 is omitted. This creates problems for the application of the rules as written, since they all rely on evaluating B, which cannot, by its definition, include the final extension without creating an out-of-bounds error: i.e. if j = i + 1 and B = S[j..i], then B = S[i + 1..i], which is impossible.

I've tried assuming B = "" (empty string) in this case, with the matching node being the root node, but this produces an insertion point that conflicts with that specified by the active point. The reason for this appears to be that, given the rules as stated and my assumption about the apparently neglected case of j = i + 1, B can only ever point to the root node in any live implementation, assuming that the specified "tricks" are followed.

My reasoning for this is as follows:

  1. By definition, there are n suffixes in a suffix tree for a string of length n (according to Wikipedia). Ukkonen's algorithm contains n phases, one per character in the input string. Thus, one suffix must be added per phase (though the rules and tricks allow additions to be deferred to later phases).

  2. There are i extensions performed for each phase i from 1 to n (i.e. the sum of i from 1 through n). Thus there are (substantially) more extensions than there are phases and thus suffixes. It is clear from this that suffix extension does NOT equal suffix insertion.

  3. No suffixes may be inserted by applying Rule 1, since this rule only updates existing leaf edges with the terminal character tracked by the global index.

  4. Thus, by 1. and 3. it follows that all suffix insertions are performed according to Rules 2 or 3. But only one suffix insertion can occur per phase, though there are i extensions, so only one extension per phase can be an insertion. As a result, the first i extensions of a phase i + 1 occur according to Rule 1.

  5. Now, if the first i extensions of phase i + 1 are performed by Rule 1, thus exhausting all cases of B = [j..i], only one extension is left to be performed by Rule 2 or Rule 3. However, the decision between which of these to apply is decided by checking the path from the end of B. But, as stated above, this is impossible, leaving the end of B in this case to be assumed as the root. The consequence of this is that all insertions from B must be from root, but it is explicitly stated in Rule 2 that nodes can be inserted from within edges.

If this is correct, then there is a direct conflict between the insertion points specified by Gusfield's rules and those specified by the active point implementation outlined by jogojapan, such that the former results in potentially many suffixes being left off of the tree.

Is my reasoning faulty or am I missing something in the text? Is it possible to synthesize these approaches?

Note: I haven't read the Gusfield book itself and have only referenced the pdf cited above, so it may be that I'm using an obsolete version of the text. In any case, I tried checking the errata, but found nothing pertaining to this chapter.

Also, I used this visualization tool to test some theses quickly to rule out certain possibilities.

Jon Badiali
  • 166
  • 6

0 Answers0