2

I am trying to use ukkonen's suffix tree to compare documents.

At this point I'm concerning about two things:

  1. First I'm trying to generate the suffix tree for one document and then use that suffix tree to find all common substrings within that document.

  2. Next is identifying all the common substrings between two documents.

I was able to generate ukkonen suffix tree for a document based on http://marknelson.us/1996/08/01/suffix-trees/ . And search for a given substring. But still I could not find a way to identify all the common substrings within the given document. Could you please tell me a way to do this.I'm using visual c++.

Can we use ukkonen's algorithm to compare two documetns and identify all the common substrings between them? If so please give step by step explanation.

There is a good explanation on Ukkonen's suffix tree in Ukkonen's suffix tree algorithm in plain English?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
user1815763
  • 103
  • 2
  • 10

1 Answers1

1

If you have two documents, you can construct a generalized suffix tree for the two documents using Ukkonen's algorithm, followed by a postprocessing step to clean up impossible configurations.

Once you have a generalized suffix tree, you can run a DFS over the tree to identify every internal node in the tree that has a leaf corresponding to a suffix of each of the documents. Each one of these nodes corresponds to a substring that's common to both documents, since each node corresponds to a prefix of a suffix of both strings.

From there, you can do whatever is most reasonable with those substrings. If you want the longest common substring, just search for the node with the highest string depth that you marked in the first step. If you want to find all such substrings, you can iterate over all those nodes and print out every substring you accumulate along the way.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065