1

I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.

EDIT @mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".

<sentence>
  <nounPhrase>
    <article/>
    <noun/>
  </nounPhrase>
  <verbPhrase>
    <verb/>
    <prepositionPhrase>
      <preposition/>
      <nounPhrase>
        <article/>
        <noun/>
      </nounPhrase>
    </prepositionPhrase>
  </verbPhrase>
</sentence>

Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:

      <nounPhrase>
        <article/>
        <adjective/>
        <adjective/>
        <noun/>
      </nounPhrase>

The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.

Obviously the shorter the tree the more frequent, so nounPhrase will be very common.

EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).

peter.murray.rust
  • 37,407
  • 44
  • 153
  • 217
  • 1
    Peter, can you give an indication of order of magnitude for the various dimensions of the problem: approximate number of trees in the collection; number of nodes in an average (and in a big/maximum) tree, expectation of the size of the longuest, relatively frequent subtree sequence, etc. The reason why this matters is that some solutions/algorithms may have a big overhead for setting things up, but should maybe be considered if the number of trees and/or the size of trees is significant. – mjv Nov 06 '09 at 04:05
  • This can definitely not be mapped to a Longest Common Substring problem. Some given nounPhrase can be an instance of some other type of nounPrase tree, even if it contains additional nodes (say, adjectives) not present in the original one. This would be consistent with the problem definition. See my answer below for a general solution to this problem. – sds May 18 '10 at 04:05

4 Answers4

4

Finding the most frequent subtrees in the collection, create a compact form of the subtree, then iterate every subtree and use a hashset to count their occurrences. 30 nodes is too big for a perfect hash - it's only about one bit per node, and you need that much to indicate whether it's a sibling or a child.

That problem isn't LCS - the most common sequence isn't related to the longest common subsequence. The most frequent subtree is that which occurs the most.

It should be at worst case O(N L^2) for N trees of length L (assuming testing equality of a subtree containing L nodes is O(L)).

Pete Kirkham
  • 48,893
  • 5
  • 92
  • 171
  • I think this is a neat solution, too; hash every node, then create a histogram of the most common hash values. – Steve Cooper Nov 06 '09 at 13:14
  • This look promisingm thanks. And Steve, a histogram is almost certainly what I need. And it allows a heuristic for re-searching new trees - check if they include the node is very quick. I'm expecting the actually frequencies to be favourable to the heuristics. – peter.murray.rust Nov 06 '09 at 17:51
3

I think, although you say that performance isn't yet an issue, this is an NP-hard problem, so it may never be possible to make it fast. If I've understood correctly, you can consider this a variant of the Longest common subsequence problem; if you flatten your tree into a straight sequence like

(nounphrase)(DOWN)(article:the)(adjective:big)(adjective:black)(noun:cat)(UP)

Then your problem becomes LCS.

Wikibooks has a java implementation of LCS here

Steve Cooper
  • 20,542
  • 15
  • 71
  • 88
  • It is indeed very similar and I had considered flattening the tree in this way. There would be some extra overhead due to the brackets, but probably not too much. The node labels could be mapped to single characters. I'll certainly try this. I had also worried that it was NP and it may be that I have to try heuristics - i.e. whenever I find a subtree I try to assess its likelihood of re-occurring. – peter.murray.rust Nov 06 '09 at 12:35
  • Is it not longest common substring? I don't want trees with missing nodes or subtrees – peter.murray.rust Nov 06 '09 at 12:44
  • Hi, Peter. I was thinking of it as a flattened list of nodes, rather than actually rendering it as a string. So long as your implementation has some kind of equality comparer that works on token type and content, you shouldn't need to turn it into a string representation. Something like bool TokenEqual(Token t1, Token2) { return t1.Type == t2.Type && t1.Content == t2.Content } Apologies if you don't know C#. Here it is in Java. bool TokenEqual(Token t1, Token2) { return t1.Type == t2.Type && t1.Content == t2.Content } ;) – Steve Cooper Nov 06 '09 at 13:09
3

This is a well-known problem in computer science, for which there are efficient solutions.

Here are some relevant references:

Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa, Optimized Substructure Discovery for Semi-structured Data, Proc. 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2002), LNAI 2431, Springer-Verlag, 1-14, August 2002.

Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest, 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002.

Or, if you just want fast code, go here: FREQT (transforming xml to S-expressions shouldn't give you too much problems, and is left as an exercise for the reader)

sds
  • 373
  • 1
  • 4
  • 7
1

I found tool called gspan very useful in this case. Its available for free download at http://www.cs.ucsb.edu/~xyan/software/gSpan.htm . Its c++ version with matlab interface is at http://www.nowozin.net/sebastian/gboost/