18

In a C# application I need to check the output of my algorithm, which is an XML tree against another XML tree to see how they are similar. (node order is important, but the structure (nested nodes), names of nodes are more important). Maybe the number of adds, removes and moves that occurs in some "Tree Edit distance" algorithms be a good indicator. But the answers are more Java or Python packages.

So, I tried to use XMLDiffPatch, it works good when the algorithm type is set to Precise. However its bad point is that it just generate a DiffGram file that needs to be analyzed to find the number of operations. Moreover, it is very buggy and generates OutOfRangeException for some XML trees. I also couldn't find better packages for my purpose for .Net. There are some Xml difference packages but maybe none or few of them are on Tree Edit Distance.

An example:

<A>
  <B>
    <C></C>
    <D></D>
    <E>
       <F>
       </F>
    </E>
  </B>
</A>

To:

<A>      
    <C></C>
    <D></D>
    <G></G>
</A>

To convert the first Xml to the second, you need to remove E and F (2 costs) then you need to remove B (but not its sub tree) and to add G. Then the total costs is 4.

So, as I know here I shouldn't ask for packages and tools, I ask for a simple algorithm or (tree edit distance algorithm in .Net) to do that. This is my own algorithm to check similarity and ignore minor difference (Having one or few nested nodes) but it is very primary and just for an starting point:

public int XMLCompare(XmlNode primary, XmlNode secondary)
{
    int x = 0;
    if (secondary == null || primary == null)
        return 1;

    if (secondary.ChildNodes.Count == 1 && primary.ChildNodes.Count > 1)
    {
        x += XMLCompare(primary, secondary.ChildNodes[0]);
    }
    else if (secondary.ChildNodes.Count > 1 && primary.ChildNodes.Count == 1)
    {
        x += XMLCompare(primary.ChildNodes[0], secondary);
    }
    else
    {
        if (primary.Name.ToLower() != secondary.Name.ToLower())
            x = 1;
        int m = Math.Max(primary.ChildNodes.Count, secondary.ChildNodes.Count);
        for (int i = 0; i < m  i++)
        {
            x += XMLCompare(
            i < primary.ChildNodes.Count ? primary.ChildNodes[i] : null,  
            i < secondary.ChildNodes.Count ? secondary.ChildNodes[i] : null);

        }
    }

    return x;
}
Community
  • 1
  • 1
Ahmad
  • 8,811
  • 11
  • 76
  • 141
  • Two trees are rated 0 (= maximum similarity) by your algorithm if the names of the root nodes are equal and one root node has no children and the other one an arbitrary amount of them. Is that intentional? – Haukinger Feb 14 '16 at 15:31
  • @Haukinger I modified it a bit, but is has many frauds, anyway it is just an starting point. – Ahmad Feb 14 '16 at 17:04
  • 1
    Could you please provide examples of XML snippets for primary and secondary what in your opinion are still similar ? would be really helpful to work on the solution. – Maxim Fleitling Feb 17 '16 at 19:50
  • @MaximFleitling I gave an example in the question, however there are some articles and sources about it in http://stackoverflow.com/questions/1065247/how-do-i-calculate-tree-edit-distance – Ahmad Feb 18 '16 at 06:32
  • 1
    Why not just convert your tree structure to a single string of node names (one letter) and use the regular algorithm for the edit distance? [link](http://stackoverflow.com/questions/15571419/c-sharp-levenshteindistance-algorithm-for-spellchecker) In your example it would consist on the calculation of the edit distance between the strings : "" and "". And for precision you can remove the angles brackets, etc. – Fran Casadome Feb 22 '16 at 22:50
  • You can try to implement Zhang-Shasha algorithm https://github.com/timtadh/zhang-shasha or try RTED algorithm http://vldb.org/pvldb/vol5/p334_mateuszpawlik_vldb2012.pdf – Valentin Feb 23 '16 at 09:52

1 Answers1

3

Microsoft has an API for that. check this. This may be old dll reference but just for you information, you need to use something like this. C:\Windows\assembly\GAC\XmlDiffPatch\1.0.8.28__b03f5f7f11d50a3a\XmlDiffPatch.dll