how to find the indexes of all matching substring using suffix tree?

Question

I created a suffix tree from this amazing answer. It works like a charm!

For now, if I look for "cat" in "This cat is a pretty cat", it will return 5 as "cat" first appearance as for starting index 5.

But I can't find a way to keep track of all the suffixes in the algorithm to create. So basically, I can find the index of the first match, but not all the different occurrences.

For now, I have:

class Edge
{
    int textIndexFrom;
    Node* nodefrom;
    Node* nodeTo;
    int textIndexTo;
}

class Node
{
    std::map<char,Edge*> m_childEdges;
    Edge* m_pParentEdge;
    Node* m_pLinkedNode;
}

I just put the relevant variables in the code above. To store the different starting positions, I imagine a std::vector is needed in Edge, but I don't see when to add a new index. We might use the active point but with the suffix links, it becomes tricky.

Could someone explain?

score 1 · Accepted Answer · answered Jun 11 '14 at 20:20

I assume you constructed a suffix tree for the string S$ where $ is some special character not present in S. The $ char ensures that each suffix has its own leaf in the tree. The number of occurances of word w in S is the number of leaves in the subtree of w.

I think that storing all starting positions in each edge/node would require quadratic memory. For example if T happens to be perfectly balanced binary tree then on level 0 (root) you have n entries, on level 1 you have 2 * n/2 entries and so on. After summing it gives us n^2. It requires proving so please correct me if I'm wrong.

Instead I think its better to keep all the leaves in a list in order they appear in dfs traversal of the tree (left to right if you draw a picture of the tree). And in every node v keep 2 pointers to the elements of that list. First should point to the first leaf of v's subtree and second to the last leaf of v's subtree. All that can be computed by simple dfs procedure. Now if for example 'cat' happens to be in the middle of some edge then go through that edge to some node v and get leaves from that node. In addition in every leaf you should store the length of the path from root to that leaf. It will help you find the position of that particular 'cat' occurance.

Yeah, I didn't think about putting a unique char at the end to have all the possible cases. For the storage space, it's what I thought too but I was a bit clueless on how to do it. — dyesdyes, Jun 11 '14 at 20:26
I don't actually think you need all of the things you said. If you go down the tree with the string to search and end up at the end of an edge which is not a leaf, then you have multiple matches and you then just need to take the starting index of each edge starting to the next node to compute the right indexes. — dyesdyes, Jun 11 '14 at 20:48
First identify all the leaves. Assume N is the length of whole string, L is length of the suffix corresponding to some leaf l. Position in S is N - L. — piotrekg2, Jun 11 '14 at 20:52

score 0 · Answer 2 · answered Jun 11 '14 at 20:10

Walk the entire cat subtree. Each leaf in that subtree corresponds to a suffix that begins with cat. If you know the length of the string you've matched so far and the length of the string, each time you encounter a leaf you can do a subtraction to find the index of the corresponding occurrence of cat.

how to find the indexes of all matching substring using suffix tree?

2 Answers2