-1

I want to introduce a deterministic sorting to my [OWL] (http://www.w3.org/TR/owl-ref/) file so that I can compare a modified file to original and more easily see where it has been changed. This file is produced by a tool (Protege) and the ordering of elements varies semi-randomly.

The problem is that sorting can't be based on simple things like given element's name and attributes. Often the differences appear only in the child nodes few levels below.

Example:

  <owl:Class rdf:about="#SomeFooClass">
    <rdfs:subClassOf><!-- subclass definition 1 -->
      <owl:Restriction>
        <owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:maxCardinality>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:ID="negate"/>
        </owl:onProperty>
      </owl:Restriction>
    </rdfs:subClassOf>
    <rdfs:subClassOf><!-- subclass definition 2 -->
      <owl:Restriction>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:about="#name"/>
        </owl:onProperty>
        <owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:maxCardinality>
      </owl:Restriction>
    </rdfs:subClassOf>

Here subclass definitions 1 and 2 (and further child elements inside those) vary in order, sometimes 1 is the first, sometimes 2.

I implemented a sort based on a few common direct attributes such a s about and ID, and while this fixes many ambiguous orderings, it can't fix this. XSLT:

<xsl:stylesheet version="2.0" 
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space  elements="*"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()">
                <xsl:sort select="@rdf:about" data-type="text"/>
                <xsl:sort select="@rdf:ID" data-type="text"/>
            </xsl:apply-templates>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

I'm thinking that maybe the solution needs to be able to calculate some kind of "hash-code" for each element, which takes into account all contents of it's child elements. This way subclass definition 1 could have hash-code 3487631 and subclass definition 2 would have 45612, and sorting between them would be deterministic (in case their child elements are unmodified).

EDIT: Just realized that the hashcode calculation should not care about the child note ordering to achieve what it is trying to do.

I could primarily use direct known attribute values and then hash-code, if those are equal. I probably would end up with something like:

<xsl:stylesheet version="2.0" 
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space  elements="*"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()">
                <xsl:sort select="@rdf:about" data-type="text"/>
                <xsl:sort select="@rdf:ID" data-type="text"/>
                <xsl:sort select="my:hashCode(.)" />
            </xsl:apply-templates>
        </xsl:copy>
    </xsl:template>

   <xsl:function name="my:hashCode" as="xs:string">
      ...
   </xsl:function>
</xsl:stylesheet>

but have no clue on how to implement my:hashCode.

EDIT: as requested, a few examples. The tool may, more or less randomly, produce for example the following kinds of results (1-3) when saving the same data:

1.

<owl:Class rdf:about="#SomeFooClass">
    <rdfs:subClassOf><!-- subclass definition 1 -->
      <owl:Restriction>
        <owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:maxCardinality>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:ID="negate"/>
        </owl:onProperty>
      </owl:Restriction>
    </rdfs:subClassOf>
    <rdfs:subClassOf><!-- subclass definition 2 -->
      <owl:Restriction>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:about="#name"/>
        </owl:onProperty>
        <owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:maxCardinality>
      </owl:Restriction>
    </rdfs:subClassOf>
</owl:Class>

2.

<owl:Class rdf:about="#SomeFooClass">
    <rdfs:subClassOf><!-- subclass definition 2 -->
      <owl:Restriction>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:about="#name"/>
        </owl:onProperty>
        <owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:maxCardinality>
      </owl:Restriction>
    </rdfs:subClassOf>
    <rdfs:subClassOf><!-- subclass definition 1 -->
      <owl:Restriction>
        <owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:maxCardinality>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:ID="negate"/>
        </owl:onProperty>
      </owl:Restriction>
    </rdfs:subClassOf>
</owl:Class>

3.

<owl:Class rdf:about="#SomeFooClass">
    <rdfs:subClassOf><!-- subclass definition 2 -->
      <owl:Restriction>
        <owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:maxCardinality>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:about="#name"/>
        </owl:onProperty>
      </owl:Restriction>
    </rdfs:subClassOf>
    <rdfs:subClassOf><!-- subclass definition 1 -->
      <owl:Restriction>
        <owl:onProperty>
          <owl:DatatypeProperty rdf:ID="negate"/>
        </owl:onProperty>
        <owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
        >1</owl:maxCardinality>
      </owl:Restriction>
    </rdfs:subClassOf>
</owl:Class>

These examples are a simplified version of the structure but should show the principle. I want to implement a XSLT sorting that will produce identical output for all 3 examples. Whether the transformed result looks like version 1, 2, or 3 (or some other ordering) is not that important.

Janne Mattila
  • 598
  • 7
  • 20
  • Already answered: * Here: http://stackoverflow.com/questions/1684963/xslt-obtaining-or-matching-hashes-for-base64-encoded-data and * Here: http://stackoverflow.com/questions/6753343/using-xsl-to-make-a-hash-of-xml-file – Sean B. Durkin Jun 20 '12 at 13:05
  • Thanks for the hash-calculation references. As I edited my OP, I think a simple hash calculation will not work. This is because I have cases such as – Janne Mattila Jun 20 '12 at 13:14
  • I don't get it. Your questions says you want a hash function, but your comments says you don't want a hash function. – Sean B. Durkin Jun 20 '12 at 13:19
  • Thanks for the hash-calculation references. As I edited my OP, I think a simple hash calculation will not work. This is because I have cases such as 1 2 3 3 2 1 here the contents are actually identical, and should be sorted into same result XML, but "c" and "d" elements are randomly ordered. Now, if hash is calculated, "b" with children c,d will result in different has than "b" with children d,c. End result may be different although it should be the same. – Janne Mattila Jun 20 '12 at 13:20
  • hash calculation was just a quick idea for a potential solution. What I want is to have a deterministic sorting of my data. Don't know if it could be improved with some kind of sorting-before-the-hash-calculation? Or if I could make XSLT start the sorting from leaf nodes? This way hash would always be calculated based on elements with deterministic order. – Janne Mattila Jun 20 '12 at 13:23
  • Maybe you could provide in your question a few simple input documents, and the corresponding sorted output documents. This will help explain. – Sean B. Durkin Jun 20 '12 at 13:26

2 Answers2

0

I am assuming that two nodes are equal, for sort purposes if the node name is the same and the text content is the same. As I understand it, the input document needs to be sorted in some deterministic fashion, but it doesn't matter what the sort is, as long as it is deterministic. Therefore I suggest sorting on the combination of node name and immediate text children. Hashing is not required.

Try this style-sheet:

<xsl:stylesheet version="2.0" 
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space  elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*" />
          <xsl:apply-templates select="node()">
            <xsl:sort select="namespace-uri()" data-type="text" />
            <xsl:sort select="local-name()" data-type="text"/>
            <xsl:sort select="string-join(./text(),' ')" data-type="text"/>
          </xsl:apply-templates>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
Sean B. Durkin
  • 12,659
  • 1
  • 36
  • 65
  • Thanks, but unfortunately the first assumption does not hold. If you look at the examples in OP, 1 and 2 should be considered the same structure and sorted to identical results even though "subclass definition 1" and "subclass definition 2" have switched places. This solution produces different end results for cases 1 and 2. – Janne Mattila Jun 21 '12 at 11:31
  • Cases 2 and 3 work, though (they are considered identical). I'm thinking the solution would need to work starting from the leaf nodes and proceeding with the sorting towards root elements. This way when nodes n1a and n2a from document a and nodes n1b and n2b from document b are sorted, their contents are already in identical order, even though their child elements may originally be in different order. I'm thinking I may need to implement this in Java instead, since it's starting to sound quite complex...? – Janne Mattila Jun 21 '12 at 11:38
0

I ended up implementing the sorting in Java after all.

Basically I sort the DOM recursively starting from children:

private Element sort(Element e) {
    // first, sort children's contents recursively
    List<Element> elementNodes = removeElementChildNodes(e);
    for (Element child: elementNodes) {
        sort(child);
    }
    // after that, sort the order of these children
    List<Element> sortedElementNodes = sortElements(elementNodes);
    // add them back
    for (Element child: sortedElementNodes) {
        e.appendChild(child);
    }
    return e;
}

and the actual sorting first compares element name and a few important attribute names, and if those all are equal, compares normalized string conversion of the node and it's children (children's contents are guaranteed to be already sorted at this point)

private class ElementSortingComparator implements Comparator<Element> {

    public int compare(Element o1, Element o2) {
        CompareToBuilder c = new CompareToBuilder(); 
        c.append(o1.getTagName(), o2.getTagName());
        c.append(o1.getAttribute(ID_ATTRIBUTE),
                o2.getAttribute(ID_ATTRIBUTE));
        c.append(o1.getAttribute(ABOUT_ATTRIBUTE),
                o2.getAttribute(ABOUT_ATTRIBUTE));
        int result = c.toComparison();
        if (result == 0) {
            String node1 = partialNodeToString(o1);
            String node2 = partialNodeToString(o2);
            result = node1.compareTo(node2);
        }
        return result;
    }
}
Janne Mattila
  • 598
  • 7
  • 20