I want to introduce a deterministic sorting to my [OWL] (http://www.w3.org/TR/owl-ref/) file so that I can compare a modified file to original and more easily see where it has been changed. This file is produced by a tool (Protege) and the ordering of elements varies semi-randomly.
The problem is that sorting can't be based on simple things like given element's name and attributes. Often the differences appear only in the child nodes few levels below.
Example:
<owl:Class rdf:about="#SomeFooClass">
<rdfs:subClassOf><!-- subclass definition 1 -->
<owl:Restriction>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
<owl:onProperty>
<owl:DatatypeProperty rdf:ID="negate"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf><!-- subclass definition 2 -->
<owl:Restriction>
<owl:onProperty>
<owl:DatatypeProperty rdf:about="#name"/>
</owl:onProperty>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
</owl:Restriction>
</rdfs:subClassOf>
Here subclass definitions 1 and 2 (and further child elements inside those) vary in order, sometimes 1 is the first, sometimes 2.
I implemented a sort based on a few common direct attributes such a s about and ID, and while this fixes many ambiguous orderings, it can't fix this. XSLT:
<xsl:stylesheet version="2.0"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()">
<xsl:sort select="@rdf:about" data-type="text"/>
<xsl:sort select="@rdf:ID" data-type="text"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
I'm thinking that maybe the solution needs to be able to calculate some kind of "hash-code" for each element, which takes into account all contents of it's child elements. This way subclass definition 1 could have hash-code 3487631 and subclass definition 2 would have 45612, and sorting between them would be deterministic (in case their child elements are unmodified).
EDIT: Just realized that the hashcode calculation should not care about the child note ordering to achieve what it is trying to do.
I could primarily use direct known attribute values and then hash-code, if those are equal. I probably would end up with something like:
<xsl:stylesheet version="2.0"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()">
<xsl:sort select="@rdf:about" data-type="text"/>
<xsl:sort select="@rdf:ID" data-type="text"/>
<xsl:sort select="my:hashCode(.)" />
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
<xsl:function name="my:hashCode" as="xs:string">
...
</xsl:function>
</xsl:stylesheet>
but have no clue on how to implement my:hashCode.
EDIT: as requested, a few examples. The tool may, more or less randomly, produce for example the following kinds of results (1-3) when saving the same data:
1.
<owl:Class rdf:about="#SomeFooClass">
<rdfs:subClassOf><!-- subclass definition 1 -->
<owl:Restriction>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
<owl:onProperty>
<owl:DatatypeProperty rdf:ID="negate"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf><!-- subclass definition 2 -->
<owl:Restriction>
<owl:onProperty>
<owl:DatatypeProperty rdf:about="#name"/>
</owl:onProperty>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
2.
<owl:Class rdf:about="#SomeFooClass">
<rdfs:subClassOf><!-- subclass definition 2 -->
<owl:Restriction>
<owl:onProperty>
<owl:DatatypeProperty rdf:about="#name"/>
</owl:onProperty>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf><!-- subclass definition 1 -->
<owl:Restriction>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
<owl:onProperty>
<owl:DatatypeProperty rdf:ID="negate"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
3.
<owl:Class rdf:about="#SomeFooClass">
<rdfs:subClassOf><!-- subclass definition 2 -->
<owl:Restriction>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
<owl:onProperty>
<owl:DatatypeProperty rdf:about="#name"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf><!-- subclass definition 1 -->
<owl:Restriction>
<owl:onProperty>
<owl:DatatypeProperty rdf:ID="negate"/>
</owl:onProperty>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
These examples are a simplified version of the structure but should show the principle. I want to implement a XSLT sorting that will produce identical output for all 3 examples. Whether the transformed result looks like version 1, 2, or 3 (or some other ordering) is not that important.