90

This is more of a CS question, but an interesting one :

Let's say we have 2 tree structures with more or less the same nodes reorganized. How would you find

  1. any
  2. in some sense minimal

sequence of operations

  • MOVE(A, B) - moves node A under node B (with the whole subtree)
  • INSERT(N, B) - inserts a new node N under node B
  • DELETE (A) - deletes the node A (with the whole subtree)

that transforms one tree to the other.

There might obviously be cases where such transformation is not possible, trivial being root A with child B to root B with child A etc.). In such cases, the algorithm would simply deliver an result "not possible".

Even more spectacular version is a generalization for networks, i.e. when we assume that a node can occur multiple times in the tree (effectively having multiple "parents"), while cycles are forbidden.

Disclaimer : This is not a homework, actually it comes from a real business problem and I found it quite interesting wondering if somebody might know a solution.

Chris Stryczynski
  • 30,145
  • 48
  • 175
  • 286
Tomas Vana
  • 18,317
  • 9
  • 53
  • 64
  • `MOVE(A,B)` seem to be the same as `INSERT(A,B)` if `A` does not have any children. What happens to the children of `A` if one does `INSERT(A,B)` ? (will they be attached to `A`'s parent ?) – Andre Holzner May 05 '11 at 10:21
  • the difference is that INSERT means really a new node, previously not in the tree (therefore not having any children, at least not in the original state where it wasn't even present). MOVE on the other hand is really a move, i.e. move of the node including its children – Tomas Vana May 05 '11 at 11:29
  • 11
    This sounds like you need to detect [graph-isomorphism](http://en.wikipedia.org/wiki/Graph_isomorphism). The part about the transformation reminds me of the [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance), which can neatly be solved in O(n*m) using dynamic programming. Maybe these pointers help you. – Björn Pollex May 05 '11 at 11:37
  • Did you ever come up with a solution? Looking at the wikipedia article and linked references I don't see an algorithm anywhere. I would like to do this in javascript where I already know the original operations that made the two trees differ, but would like to produce an optional diff: for instance, if part of the tree was pruned and then re-grafted to the same spot it would optimize to no change. – Michael Jan 29 '13 at 19:46
  • @Michael, have you found something usefull? I watching for the same alhoritm of changes reduction in tree. – Pavel Mar 24 '16 at 05:15
  • Sounds close to what an OS/3rd party tool has to do to compare folders of files. – NoChance Jun 28 '18 at 05:54

5 Answers5

27

There is not only a Wikipedia article on graph isomorphism (as Space_C0wb0y points out) but also a dedicated article on the graph isomorphism problem. It has a section Solved special cases for which polynomial-time solutions are known. Trees is one of them and it cites the following two references:

Andre Holzner
  • 18,333
  • 6
  • 54
  • 63
17

You weren't clear if you were comparing abstract syntax trees for source code, XML documents interpreted as trees, or some other type of tree.

There's a number of papers that discuss comparing syntax trees and computing minimum distances by various means. The ideas should be relevant.

A good paper is Change Distilling, which tries to compare the source code for two abstract syntax trees and report a minimal difference. The paper talks about a specific method, and also briefly mentions (and provides references) to a variety of similar techniques.

Few of these algorithms are actually realized in available tools for comparing computer program source text. Our Smart Differencer is one of them; it is driven by an explicit language grammar for many languages.

Wilfred Hughes
  • 29,846
  • 15
  • 139
  • 192
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 2
    Actually, in our case it's not source code, these are really trees. There is some semantic in those trees, but all in all not that important - they are directly manipulated by the users **as a tree** – Tomas Vana May 11 '11 at 06:36
  • Broken link: I just spent 20 minutes looking for the "Change Distilling" paper. Here's the updated link: http://www.merlin.uzh.ch/publication/show/2531 The software project itself has moved to https://bitbucket.org/sealuzh/tools-changedistiller/wiki/Home (which is how I got the correct link to the PDF) – Shalom Craimer Dec 16 '16 at 12:18
14

Although this question is old, i'll add a couple more references and algorithms below:

  1. X-Diff: An Effective Change Detection Algorithm for XML Documents , Yuan Wang, David J. DeWitt, Jin-Yi Cai
  2. KF-Diff+: Highly Efficient Change Detection Algorithm for XML Documents
  3. diffX: An Algorithm to Detect Changes in Multi-Version XML Documents
  4. Change Detection in XML Trees: a Survey, Luuk Peters
  5. Similarity in Tree Data Structures

Furthermore, there are libraries and frameworks on GitHub (in javascript) which implement diffing of Tree-like structures for example applications dealing with JSON data or XML Trees (e.g for client-side MVC/MVVM):

  1. React.js
  2. JSON-Patch
  3. jsondiffpatch
  4. objectDiff
Timmmm
  • 88,195
  • 71
  • 364
  • 509
Nikos M.
  • 8,033
  • 4
  • 36
  • 43
  • Highly recommend reading the `Change Detection in XML Trees: a Survey` paper - it lists dozens of algorithms for XML diffing (which is just tree diffing). – Timmmm Jan 08 '20 at 12:24
8

In case folks find this question and need something implemented for Node.js or the browser, I'm providing a link and code example for an implementation I've written that you can find on github here: (https://github.com/hoonto/jqgram.git) based on the existing PyGram Python code (https://github.com/Sycondaman/PyGram).

This is a tree edit distance approximation algorithm, but it is much, much faster than trying to find the true edit distance. The approximation performs in O(n log n) time and O(n) space whereas true edit distance is often O(n^3) or O(n^2) using known algorithms for true edit distance. See the academic paper from which the PQ-Gram algorithm comes: (http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf)

So using jqgram:

Example:

var jq = require("jqgram").jqgram;
var root1 = {
    "thelabel": "a",
    "thekids": [
        { "thelabel": "b",
        "thekids": [
            { "thelabel": "c" },
            { "thelabel": "d" }
        ]},
        { "thelabel": "e" },
        { "thelabel": "f" }
    ]
}

var root2 = {
    "name": "a",
    "kiddos": [
        { "name": "b",
        "kiddos": [
            { "name": "c" },
            { "name": "d" },
            { "name": "y" }
        ]},
        { "name": "e" },
        { "name": "x" }
    ]
}

jq.distance({
    root: root1,
    lfn: function(node){ return node.thelabel; },
    cfn: function(node){ return node.thekids; }
},{
    root: root2,
    lfn: function(node){ return node.name; },
    cfn: function(node){ return node.kiddos; }
},{ p:2, q:3 },
function(result) {
    console.log(result.distance);
});

And that gives you a number between 0 and 1. The closer to zero, the more closely related the two trees look to jqgram. One approach could be to use jqgram to narrow in on several closely related trees from among many trees given its speed, then utilize true edit distance on the few trees remaining that you need to take a closer inspection of, and for that you can find python implementations for reference or port of the Zhang & Shasha algorithm for example.

Note that the lfn and cfn parameters specify how each tree should determine the node label names and the children array for each tree root independently so that you can do funky things like comparing an object to a browser DOM for example. All you need to do is provide those functions along with each root and jqgram will do the rest, calling your lfn and cfn provided functions to build out the trees. So in that sense it is (in my opinion anyway) much easier to use than PyGram. Plus, its Javascript, so use it client or server-side!

ALSO, to answer with respect to cycle detection, check out the clone method inside of jqgram, there is cycle detection there, but credit for that goes to the author of node-clone from which that piece was slightly modified and included.

Matt Mullens
  • 2,266
  • 15
  • 14
  • does this allow multiple lfn ? I want to match more than the label, ie. also the stored value. node.value. – john k Jun 27 '18 at 19:58
3

This is called the tree to tree correction problem or the tree to tree editing problem. Most of the literature dealing with this explicitly relates to comparing XML trees for some reason, so searching for "XML diffing algorithm" yields a lot of results. In addition to Nikos's list of links, I found these:

I also strongly recommend reading Change Detection in XML Trees: a Survey but it is from 2005 so barely any of the tools it mentions exist anymore. Comparing XML Documents as Reference-aware Labeled Ordered Trees has the best intuitive description of some of the algorithms that I have found so far (start at section 2.1.2).

Unfortunately there doesn't seem to be much open source code available that does this and isn't ancient. Just a lot of overly-complex papers. :-/

Timmmm
  • 88,195
  • 71
  • 364
  • 509
  • I'm not be able to see this paper though, is the pdf link broken? `Change Detection in XML Trees: a Survey` – Mengo May 28 '20 at 19:01
  • Works for me. Did you click the `Download full-test PDF` button? Maybe try Sci-hub if it's blocked for some reason. – Timmmm May 29 '20 at 10:12