29

I need to calculate the edit distance between trees. This paper describes an algorithm, but I can't make heads or tails out of it. Could you describe an applicable algorithm in a more approachable way? Pseudocode or code would both be helpful.

NoDataDumpNoContribution
  • 10,591
  • 9
  • 64
  • 104
108
  • 1,631
  • 3
  • 19
  • 25

7 Answers7

28

This Python library implements the Zhang-Shasha algorithm correctly: Zhang-Shasha: Tree edit distance in Python

It began as a direct port of the Java source listed in the currently accepted answer (the one with the tarball link), but that implementation is not correct and is nearly impossible to run at all.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Steve Landey
  • 2,609
  • 25
  • 25
  • Thanks for contributing that back -- glad you were able to implement the Zhang-Shasha algorithm correctly. Sorry the code I linked to wasn't working. – Naaff Nov 24 '10 at 18:27
  • 4
    Steve's fork is no longer the canonical fork of the algorithm see: https://github.com/timtadh/zhang-shasha – tim.tadh Dec 08 '10 at 09:42
  • What are the applications of such a calculation? – X10D Jan 15 '22 at 17:28
  • I've been using this repo for some time and it works fine. However, then it comes to comparing trees that have more than 1000 nodes, the Zhang-shasha algorithm becomes really slow. I was wondering is there any faster TED computation available in Python? – Sta_Doc Mar 11 '22 at 07:15
  • I landed here looking for an implementation that helped me wrap my head around the core of the idea/algorithm. The previously mentioned algorithms helped a bit, but they stuck to the paper's original notation. While great for papers, I find it gets in the way of code understanding. FWIW, here's my attempt: https://github.com/kitizz/tree_edit_distance – kitizz Feb 05 '23 at 04:10
8

I wrote an implementation (https://github.com/hoonto/jqgram.git) based on the existing PyGram Python code (https://github.com/Sycondaman/PyGram) for those of you who wish to use tree edit distance approximation using PQ-Gram algorithm in the browser and/or in Node.js.

The jqgram tree edit distance approximation module implements the PQ-Gram algorithm for both server-side and browser-side applications; O(n log n) time and O(n) space performant where n is the number of nodes. See the academic paper from which the algorithm comes: http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf

The PQ-Gram approximation is much faster than obtaining the true edit distance via Zhang & Shasha, Klein, or Guha et. al, whom provide true edit distance algorithms that all perform minimum O(n^2) time and are therefore often unsuitable.

Often in real-world applications it is not necessary to know the true edit distance if a relative approximation of multiple trees to a known standard can be obtained. JavaScript, in the browser and now on the server with the advent of Node.js deal frequently with tree structures and end-user performance is usually critical in algorithm implementation and design; thus jqgram.

Example:

var jq = require("jqgram").jqgram;
var root1 = {
    "thelabel": "a",
    "thekids": [
        { "thelabel": "b",
        "thekids": [
            { "thelabel": "c" },
            { "thelabel": "d" }
        ]},
        { "thelabel": "e" },
        { "thelabel": "f" }
    ]
}

var root2 = {
    "name": "a",
    "kiddos": [
        { "name": "b",
        "kiddos": [
            { "name": "c" },
            { "name": "d" },
            { "name": "y" }
        ]},
        { "name": "e" },
        { "name": "x" }
    ]
}

jq.distance({
    root: root1,
    lfn: function(node){ return node.thelabel; },
    cfn: function(node){ return node.thekids; }
},{
    root: root2,
    lfn: function(node){ return node.name; },
    cfn: function(node){ return node.kiddos; }
},{ p:2, q:3, depth:10 },
function(result) {
    console.log(result.distance);
});

Note that the lfn and cfn parameters specify how each tree should determine the node label names and the children array for each tree root independently so that you can do funky things like comparing an object to a browser DOM for example. All you need to do is provide those functions along with each root and jqgram will do the rest, calling your lfn and cfn provided functions to build out the trees. So in that sense it is (in my opinion anyway) much easier to use than PyGram. Plus, it’s JavaScript, so use it client or server-side!

Now one approach you can use is to use jqgram or PyGram to get a few trees that are close and then go on to use a true edit distance algorithm against a smaller set of trees. Why spend all the computation on trees you can already easily determine are very distant, or vice versa? So you can use jqgram to narrow down choices too.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Matt Mullens
  • 2,266
  • 15
  • 14
  • The www.vldb2005.org link is broken: *"Oops! That page can’t be found. It looks like nothing was found at this location."* – Peter Mortensen Aug 17 '21 at 10:02
  • So, just to make sure I'm understanding - PQ-Gram doesn't actually give an edit distance, right? The idea is to use PQ-Gram to find the "most similar" tree (according to PQ-Gram) in a group of candidate trees and then (if desired) use Zhang-Shasha or something similar to calculate the exact edit distance. Correct? – rinogo Feb 06 '22 at 00:25
  • PS Thanks for the nice library, Matt! – rinogo Feb 06 '22 at 00:25
7

Here's some Java source code (gzipped tarball at the bottom) for a tree edit distance algorithm that might be useful to you.

The page includes references and some slides that go through the "Zhang and Shasha" algorithm step-by-step and other useful links to get you up to speed.

The code in the link has bugs. Steve Johnson and tim.tadh have provided working Python code. See Steve Johnson's comment for more details.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Naaff
  • 9,213
  • 3
  • 38
  • 43
  • 3
    The implementation linked here is incorrect. (See my answer.) I started my implementation by porting it, but when I finally found the paper it was referencing, I found a few departures from the original paper which caused it to fail basic tests of symmetry, triangle inequality, etc. – Steve Landey Nov 24 '10 at 17:32
4

Here you find Java implementations of tree edit distance algorithms:

Tree Edit Distance

In addition to Zhang&Shasha's algorithm of 1989, there are also tree edit distance implementations of more recent algorithms, including Klein 1998, Demaine et al. 2009, and the Robust Tree Edit Distance (RTED) algorithm by Pawlik&Augsten, 2011.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
1

I made a simple Python wrapper (apted.py) for the APTED algorithm using jpype:

# To use, create a folder named lib next to apted.py, then put APTED.jar into it

import os, os.path, jpype

global distancePackage
distancePackage = None

global utilPackage
utilPackage = None

def StartJVM():
  # from http://www.gossamer-threads.com/lists/python/python/379020
  root = os.path.abspath(os.path.dirname(__file__))
  jpype.startJVM(jpype.getDefaultJVMPath(),
  "-Djava.ext.dirs=%s%slib" % (root, os.sep))
  global distancePackage
  distancePackage = jpype.JPackage("distance")
  global utilPackage
  utilPackage = jpype.JPackage("util")


def StopJVM():
  jpype.shutdownJVM()


class APTED:
  def __init__(self, delCost, insCost, matchCost):
    global distancePackage
    if distancePackage is None:
      raise Exception("Need to call apted.StartJVM() first")
    self.myApted = distancePackage.APTED(float(delCost), float(insCost), float(matchCost))

  def nonNormalizedTreeDist(self, lblTreeA, lblTreeB):
    return self.myApted.nonNormalizedTreeDist(lblTreeA.myLblTree, lblTreeB.myLblTree)


class LblTree:
  def __init__(self, treeString):
    global utilPackage
    if utilPackage is None:
      raise Exception("Need to call apted.StartJVM() first")

    self.myLblTree = utilPackage.LblTree.fromString(treeString)

'''
# Example usage:

import apted
apted.StartJVM()
aptedDist = apted.APTED(delCost=1, insCost=1, matchCost=1)
treeA = apted.LblTree('{a}')
treeB = apted.LblTree('{b{c}}')
dist = aptedDist.nonNormalizedTreeDist(treeA, treeB)
print dist


# When you are done using apted
apted.StopJVM()
# For some reason it doesn't usually let me start it again
# and crashes the Python interpreter upon exit when I do
# this, so call only as needed.
'''
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Phylliida
  • 4,217
  • 3
  • 22
  • 34
0

There are many variations of tree edit distance. If you can go with top-down tree edit distance, which limits insertions and deletes to the leaves, I suggest trying the following paper: Comparing Hierarchical Data in External Memory.

The implementation is a straightforward dynamic programming matrix with O(n2) cost.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Davi
  • 507
  • 5
  • 15
-1

There is a journal version of the ICALP2007 paper you refer to, An Optimal Decomposition Algorithm for Tree Edit Distance.

This version also has pseudocode.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131