14

The longest common substring problem according to wiki can be solved using a suffix tree.
From wiki:

The longest common substrings of a set of strings can be found by building a generalised suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it

I don't get this.
Example: if I have:
ABCDE and XABCZ
then the suffix tree is (some branches from XABCZ omitted due to space):
enter image description here

The longest common substring is ABC but it is not I can not see how the description of wiki helps here.
ABC is not the deepest internal nodes with leaf nodes.
Any help to understand how this works?

Cratylus
  • 52,998
  • 69
  • 209
  • 339
  • `ABC is not the deepest internal nodes with leaf nodes.` No, but ABC *is* the longest *common* string of nodes anywhere in the tree. The next longest ones are `B-C` and `D-E`, with two nodes each. – Robert Harvey Jun 12 '12 at 20:15
  • Yes `ABC` is the longest common string. But I don't understand how the wiki description would actually help me find it programmatically – Cratylus Jun 12 '12 at 20:20
  • 4
    You have to read the other Wiki: http://en.wikipedia.org/wiki/Generalised_suffix_tree. There are probably some better (more easily understandable) resources [here](https://www.google.com/search?q=generalized+suffix+tree). See also http://stackoverflow.com/questions/969448/generalized-suffix-tree-java-implementation – Robert Harvey Jun 12 '12 at 20:22
  • @user384706 - I think part of the problem is that's not a proper suffix tree. You should only one branch that starts root-A-B-C, and the C would have two children: D-E and Z. Similar for the rest of the branches. What you have there is basically just a list of suffixes that all have a root node pointing at them. – twalberg Jun 12 '12 at 20:36
  • @twalberg:Yes, you are right. The 2 blue branches would be one. But in this case how could I programmatically find them? It is not clear to me how the wiki description helps – Cratylus Jun 12 '12 at 20:46
  • @user384706 - The function that properly creates the suffix tree from the set of input strings would find them. You essentially start with an empty tree (note, not a binary tree, as each node may have many children) and insert each suffix into the tree once. There's some other bookkeeping involved as well, but that's the core concept. – twalberg Jun 12 '12 at 20:52
  • @twalberg:After the tree is build, it must be somehow traversed in order to find the nodes that represent the longest substring.How are these actually located is what I am missing here – Cratylus Jun 12 '12 at 20:55
  • @user384706 Once you have the tree built, you can walk the tree in any way you want (in order, pre order, post order, etc.), and keep track of the deepest node you find that has at least one child belonging to a suffix of every input string - of course that implies that every node has some information pointing to which suffix of which string caused this node to be created - that's the extra bookkeeping I was referring to. The wiki has links to a couple well known algorithms. – twalberg Jun 12 '12 at 20:59
  • @twalberg:Ok, but how does this traversal fit-in with the "recipe" described by wiki in my OP quote? – Cratylus Jun 12 '12 at 21:03
  • @user384706 Maybe I don't understand what you're asking, but I don't know how to make the "recipe" any clearer than that - build a generalized suffix tree (using Ukkonen's algorithm or McCreight's algorithm or some variant), then walk through the tree nodes to find the deepest one(s) having children that originated in all of the original strings. – twalberg Jun 12 '12 at 21:18

2 Answers2

8

Like what's been said before, your tree is incorrect.

This is what I get when running "ABCDE$XABCZ" through my code.

Suffix Tree code:

String = ABCDE$XABCZ$
End of word character 1 = $
└── (0)
    ├── (20) $
    ├── (22) ABC
    │   ├── (15) DE$
    │   └── (23) Z$
    ├── (24) BC
    │   ├── (16) DE$
    │   └── (25) Z$
    ├── (26) C
    │   ├── (17) DE$
    │   └── (27) Z$
    ├── (18) DE$
    ├── (19) E$
    ├── (21) XABCZ$
    └── (28) Z$

In a (compact) suffix tree, you need to find the deepest internal node(s) which have leaf nodes from all the strings. If you have multiple nodes at the same depth, you have to compare the length of the string represented by that node. i.e. ABC, BC, and C all have the same depth, so you have to compare the length of ABC, BC, and C strings to see which is longer; which is ABC obviously.

Suffix Trie code:

└── null
    ├── A
    │   └── B
    │       └── C
    │           ├── D
    │           │   └── (E) ABCDE
    │           └── (Z) ABCZ
    ├── B
    │   └── C
    │       ├── D
    │       │   └── (E) BCDE
    │       └── (Z) BCZ
    ├── C
    │   ├── D
    │   │   └── (E) CDE
    │   └── (Z) CZ
    ├── D
    │   └── (E) DE
    ├── (E) E
    ├── X
    │   └── A
    │       └── B
    │           └── C
    │               └── (Z) XABCZ
    └── (Z) Z

In a (not compact) suffix trie, find the deepest internal node(s) which have leaf nodes from all the strings.

Hope it helps.

Justin
  • 4,196
  • 4
  • 24
  • 48
  • 1
    Is this a "collapsed" suffix tree? Also what are the numbers e.g. `(20)` or `(24)` etc? – Cratylus Jun 12 '12 at 21:31
  • Yes, this is collapsed, but collapsing means merging e.g. "DE" into a single node. All suffix trees, however, would combine "CDE$" and "CZ$" into one "C" node with a "DE$" branch and a "Z$" branch -- but a collapsed suffix tree would treat "DE$" as one node, and an uncollapsed suffix tree would treat "DE$" as three nodes, one for each character. – Louis Wasserman Jun 12 '12 at 21:45
  • @user384706 Yea, it's a collapsed tree. The numbers are just node identifiers, nothing to note. I've also created a suffix trie, if you are interested. I added it to the answer. – Justin Jun 12 '12 at 21:46
4

You have not actually drawn the suffix tree. Had you done it properly, at the root you would only have every possible character once. The tree only splits when a single letter can have multiple following suffixes. That forces common substrings together in the tree, which makes them findable.

btilly
  • 43,296
  • 3
  • 59
  • 88
  • I am not sure I follow you here. If the string was: `ABCDBEF` then there would be a subranch under `B` for `BCDBEF` and `BEF` but for this example i.e. `ABCDE` won't we have a branch for each possible suffix? – Cratylus Jun 12 '12 at 20:40
  • In the example diagram you gave, there should be at most one child of the root labeled 'A'. You should have merged those two nodes. – Louis Wasserman Jun 12 '12 at 21:37
  • @LouisWasserman:Really?Why?`A` is a prefix in `ABCDE`. Why would there be a child node of root that is `A`? – Cratylus Jun 12 '12 at 21:39
  • The entire point of a suffix tree is to merge matching prefixes of suffixes into the same roots. There should be one branch off the root that goes `ABC`, and then that last node should have two children, one `X` and one `DE`. – Louis Wasserman Jun 12 '12 at 21:41
  • @LouisWasserman:Ok. I agree.But in this case, i.e. having the 2 merged nodes, how can I programmatically find the longest common substring using the wiki description?I am stuck on this – Cratylus Jun 12 '12 at 21:44
  • If you look at the properly drawn tree, you'll see that "ABC" is, well, the deepest internal node with children coming from each of the original strings. – Louis Wasserman Jun 12 '12 at 21:46