Extracting all occurrences of repeated and unique patterns from text, along with context

Question

Say I have the text "abcabx". I would like to know that there is a repeated pattern "ab", all the locations it appears, and how the context of those repetitions relates to its other occurrences. I also want the data structure to have the unique patterns "c" and "x" distinguished and isolated. I have setup a suffix tree in attempt to do so, and it looks like this (from this SO answer):

abcabx

This does indeed tell me that the pattern "ab" appears twice, once with the suffix "cabx" and another with "x". However, the "ab" at root only points to the first occurrence of the pattern. It also has another "ab" embedded in its leaf "cabx", when I'd want that "ab" (in the "cabx") to somehow be acknowledged as a repeat in the data structure. I know that the "x" leaf of the root "ab" represents it, but I need to know, in the "cabx" leaf of "ab", that there is an "ab" in there. Plus that two unique patterns, "c" and "x", are part of that edge. Plus their locations in that edge, and a cross-reference between their "main definitions" (root edges?). It seems that such things could be figured out by iterating around the tree and putting it together, but I need a data structure that stores this information out-right.

To maybe put it simpler, the data structure needs to clearly say "here are all the unique patterns", "here are all the repeated patterns and every place they happen at", and "here is the context that relates all of these things".

So I guess I'm looking for a graph-like element to the suffix tree, something that will partition out known patterns and relate them explicitly. In the process, patterns that are unique would be noted. But I still want the contextual features of the suffix tree, such as saying both "c" (not "cabx", but "c") and "x" came after "ab", "abx" came after "abc", what came after them (in larger cases), etc. Is there an adaptation of the suffix tree that does this, or perhaps another algorithm?

Have you looked at the _enhanced suffix array_ ([DOI 10.1016/S1570-8667(03)00065-0](http://dx.doi.org/10.1016/S1570-8667(03)00065-0))? I am not 100% sure that it matches your needs perfectly, but: (a) being a _suffix array_, it makes it quite easy to get to _all_ contexts of a substring, (b) being _enhanced_, it includes data structures that emulate a suffix tree, i.e. you can get to (an emulated version of) child nodes etc., (c) it can be searched efficiently, (d) it can be stored in compact/succinct/compressed forms, much more effectively than trees. — jogojapan, Sep 10 '12 at 03:04

score 1 · Accepted Answer · answered Sep 07 '12 at 14:02

Suffix tree basically just stores all the suffixes of a string in a fashion which makes it easy to search for substrings. Each substring that is repeated more than once will correspond to exactly one non-terminal node. It is relatively easy to find the context in which the pattern appears -- if you count the number of symbols in each branch it will give you the offset of the substrings end from the end of the sequence, e.g. there're two branches from ab, one of length 1 and one of length 4, so you know that the pattern appears 3 and 6 symbols from the end of the string, or 3 and 0 from the beginning.

Extracting all occurrences of repeated and unique patterns from text, along with context

1 Answers1