28

Could anyone give me an example about how and when to create a suffix link in suffix tree?

If my string is ABABABC, but do use a different example if that is better.

Hope to give some pictures to illustrate every step.

very appreciate.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
lingguang1997
  • 451
  • 1
  • 4
  • 12

1 Answers1

68

To understand this, first recall that there are three kinds of nodes in a suffix tree:

  • The root
  • Internal nodes
  • Leaf nodes

In the graph below, which is the suffix tree for ABABABC, the yellow circle is the root, the grey, blue and green ones are internal nodes, and the small black ones are leaves.

There are two important things to notice:

  • Internal nodes always have more than 1 outgoing edge. That is, internal nodes mark those parts of the tree where branching occurs.

  • Branching occurs wherever a repeated string is involved, and only there. For any internal node X, the string leading from the root to X must have occurred in the input string at least as many times as there are outgoing edges from X.

Example: The string leading to the blue node is ABAB. Indeed, this string appears twice in the input string: At position 0 and at position 2. And that is why the blue node exists.

Now about suffix links:

  1. If the string s leading up to some internal node X is longer than 1 character, the same string minus the first character (call this s-1) must be in the tree, too (it's a suffix tree, after all, so the suffix of any of its strings must be in the tree, too).

    Example: Let s=ABAB, the string leading to the blue node. Then after removing the first character, s-1 is BAB. And indeed that string is found in the tree, too. It leads to the green node.

  2. If some string s leads to an internal node, its shortened version s-1 must lead to an internal node (call it X-1) as well. Why? Because s must appear at least twice in the input string, so s-1 must appear at least as many times (because it is part of s: wherever s appears, s-1 must appear, too). But if s-1 appears multiple times in the input string, then there must be an internal node for it.

  3. In any such situation, a special link connecting X to X-1 is a suffix link.

Note: Because of (1) and (2) above, every internal node X that has a label from root to X of more than 1 character must have a suffix link to exactly one other internal node.

Example:

This is the same suffix tree as before; the dotted lines indicate the suffix links. If you start at the blue node and follow the suffix links from there (from blue, to green, to first gray, to second gray), and look at the strings leading from the root to each node, you will see this:

 ABAB   ->    BAB    ->    AB    ->    B
(blue)      (green)     (gray1)     (gray2)

This is why they are called suffix links (the entire sequence is called suffix chain).

What are they good for?

They are good for surprisingly many things. However, they play a particular role in Ukkonen's algorithm for suffix tree construction, specifically in Rule 3 described there: After inserting a the final character of some suffix s at some point X, the algorithm needs to insert the final character of suffix s-1 in O(1) time. In order to do that, it uses the suffix link to jump right to the place X-1 and makes the insert.

But, note that there is no necessity to put suffix links in a suffix tree. They are not part of the definition of a suffix tree — they are just special links used by some algorithms that construct or use suffix trees.


Regarding single-character nodes: What if there is an internal node X whose string (i.e. the string on the path from root to X) consists of only one character? By the definition above, X then does not have a suffix link. You can however assume that if it had a suffix link, it would point to the root node. Furthermore: If, by the definition above, an internal node does not have a suffix link, it must be a single-character node, so you can always assume that if no suffix link is present at an internal node it must be a single-character node, and therefore, the node that represents the s-1 suffix is the root node. (Some algorithms may actually add an explicit suffix link pointing to the root node in this case.) Thanks to j_random_hacker for the comment about this.

Community
  • 1
  • 1
jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • Thanks for your quick answer. I understand the concept of suffix tree. Hope to get a concrete implementation of it. – lingguang1997 Apr 18 '12 at 17:50
  • 1
    "For any internal node X, the string leading from the root to X must have occurred in the input string as many times as there are outgoing edges from X." The substring AB occurs 3 times in ABABABC, but there are only 2 outgoing edges from the grey dot whose path is labelled AB. – librik May 19 '12 at 03:12
  • 2
    @librik Thank you. I've inserted the words 'at least'. I believe that makes it a correct (and still useful) statement. – jogojapan May 22 '12 at 00:48
  • 4
    Fantastically clear and gentle introduction. I would add only that suffix links also exist from nodes labelled with a single character, back to the root. – j_random_hacker Aug 08 '13 at 16:38
  • 1
    @j_random_hacker Yes, that's right. The reason I didn't include them in the description is that they need not be added to the tree explicitly. If suffix links are added to the tree systematically during tree construction (as is the case in Ukkonen's algorithm), you can simply assume that _any_ internal node that does not have an outgoing suffix link, is a single-character node and its suffix link therefore must lead to the root node. But it's true that at least _conceptually_ these suffix links still exist. – jogojapan Aug 10 '13 at 01:27
  • Thanks for this - it's a great answer about *What* are suffix links. However, the OP is asking *How* and *When* do we create suffix links (I believe that he's referring to the context of Ukkonen's algorithm) - and you haven't answered that... – ihadanny Nov 12 '16 at 08:23
  • @ihadanny Have you looked at the "What are they good for?" section? – jogojapan Nov 12 '16 at 09:14
  • 1
    Hey @jogojapan - first, I'd like to thank you personally for this answer and the great answer about Ukkonen in plain English. However, I still think that the "What are they good for" section describes **how are they used, assuming they already exist**. The OP would like to know is **when do we create a suffix link**. I guess this is trivial for you that once adding an internal node you immediately can add a suffix link to the previously created internal node, and that covers all the suffix links we need to create - but it's not really intuitive for me... – ihadanny Nov 12 '16 at 09:39
  • You are the legend! – Kaushal28 May 04 '17 at 12:43
  • @jogojapan Can you elaborate on why one cannot simply replace the logic of suffix links with keeping pointers to all nodes in the form of a hash table? e.g. when done with node: ABC look for $p[BC]$ where $p$ is the hash table with key=suffix and value=pointer to node for that suffix. – Anon Mar 13 '23 at 12:31
  • @Anon You can do that, but don't forget that for each hash lookup a hash value has to be computed for the string you are looking up. If that string is N characters in length, computing the hash function will be O(N) time. (Hashes are often considered O(1) lookup, but that is because in discussions about hash complexity, the time for computing the hash value is usually ignored because the point in those discussions is about the retrieval time compared to, say, trees, and then the length of the key is negligible.) – jogojapan Mar 16 '23 at 21:34