1

I have a requirement to validate actual XMLs data against expected XML Data.

Example: Expected Data

<root>
<Orders>
    <Order>
        <DateTime>\d{8}_\d{4}</DateTime>
        <OrderID>\d{4}</OrderID>
    </Order>
        <Order>
        <DateTime>\d{8}_\d{4}</DateTime>
        <OrderID>\d{4}</OrderID>
    </Order>
</Orders>
<queryStatus>Success</queryStatus>
</root>

Since DateTime and OrderIDs would change at each execution , therefore I am maintaining a pattern instead of hard-coding DateTime or OrderIDs. The above is just a Sample - We will have several different XMLs that we need to compare and validate.

The above Actual Data should match either of the two XMLs

XML1:

<root>
<Orders>
    <Order>
        <DateTime>08052021_1250</DateTime>
        <OrderID>1234</OrderID>
    </Order>
        <Order>
        <DateTime>08052021_1251</DateTime>
        <OrderID>4567</OrderID>
    </Order>
</Orders>
<queryStatus>Success</queryStatus>
</root>

XML2:

<root>
<queryStatus>Success</queryStatus>
<Orders>
    <Order>
        <DateTime>08052021_1250</DateTime>
        <OrderID>1234</OrderID>
    </Order>
        <Order>
        <DateTime>08052021_1251</DateTime>
        <OrderID>4567</OrderID>
    </Order>
</Orders>
</root>

As far as I understand xmlunit will report that DateTime and OrderID do not match. I am open for a Java based solution or a Bash script (xmllint) based solution. Can you please help with any pointers on how to approach this.

ErikMD
  • 13,377
  • 3
  • 35
  • 71
  • 2
    Look into [XSD](https://en.wikipedia.org/wiki/XML_Schema_\(W3C\)), [RELAX NG](https://en.wikipedia.org/wiki/RELAX_NG), or similar tools. – Shawn Aug 05 '21 at 20:45
  • `\d{8}_\d{4}` a regex to validate a pattern but not the actual value. Do you need to validate that values are equal? As far as XML content, XML1 and XML2 look identical. – LMC Aug 05 '21 at 20:53
  • XML1 and XML2 are identical except that the XML elements have different sequences. Yes \d{8}_\d{4} is a regex . Since i need to validate the outputs that will have a different Datetime / order number on every execution against an 'expected data', I cannot put the actual values in the 'expected data'. Therefore I need to put in a regex ( or something ) and we need to match if the actual data against the expected data format. – Nishant Shrivastava Aug 05 '21 at 21:40

4 Answers4

2

As suggested by @Shawn in the comments, you may want to use standard tools to validate XML data w.r.t. a model, specified in a dedicated language such as XML Schema or RELAX NG.

Writing a XSD model

First, you need to write a definition of this model, say an XML Schema Definition (.xsd), taking into account the constraint that you mentioned regarding the order indifference for elements Orders and queryStatus: relying on the <xs:all> construct instead of the <xs:sequence> one for the rootType definition.

model.xsd

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
  attributeFormDefault="unqualified"
  elementFormDefault="qualified">
  <xs:element name="root" type="rootType"/>
  <xs:complexType name="rootType">
    <xs:all>
      <xs:element type="queryStatusType" name="queryStatus"/>
      <xs:element type="OrdersType" name="Orders"/>
    </xs:all>
  </xs:complexType>
  <xs:simpleType name="queryStatusType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="Success"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:complexType name="OrdersType">
    <xs:sequence>
      <xs:element type="OrderType" name="Order" minOccurs="1" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
  <xs:complexType name="OrderType">
    <xs:sequence>
      <xs:element type="DateTimeType" name="DateTime"/>
      <xs:element type="OrderIDType" name="OrderID"/>
    </xs:sequence>
  </xs:complexType>
  <xs:simpleType name="DateTimeType">
    <xs:restriction base="xs:string">
      <xs:pattern value="[0-9]{8}_[0-9]{4}"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="OrderIDType">
    <xs:restriction base="xs:string">
      <xs:pattern value="[0-9]{4}"/>
    </xs:restriction>
  </xs:simpleType>
</xs:schema>

The XML Schema definition above was written following the so-called Venetian Blind design, but this is not the only possible choice.

Obviously, this proof-of-concept should be adapted to your needs, e.g., if the minOccurs="1" maxOccurs="unbounded" spec above is irrelevant, etc.; but as the XML Schema language is a bit involved and very expressive, I'd suggest reading some introductory course on this topic before looking at references such as its W3C specification: StructuresDatatypes.

Validating your XML documents

You need:

  • An XML Validator
  • The .xsd file (similar to the example above)
  • The .xml documents to validate

This is actually language-agnostic because as you mention, you can either:

  1. Use the standard libraries for XML from your usual programming language;
  2. Use a shell script using a command-line tool such as xmllint.

Regarding the latter choice, as mentioned in this other SO answer, you can just run:

xmllint --noout --schema model.xsd file1.xml
# → file1.xml validates
echo $?
# → 0
ErikMD
  • 13,377
  • 3
  • 35
  • 71
  • Hello Erik, thanks for the answer. However, a combination of xmllint & sorting with xsltproc with actual validation using grep / diff seems to be provide a working solution will less effort than creating XSD schemas for a large number of XML files. However, i do agree, your solution is a more elegant solution if we can create the XSD schema files. – Nishant Shrivastava Aug 07 '21 at 02:35
1

A RELAX-NG compact schema to validate your documents:

element root {
   element Orders {
      element Order {
         element DateTime {
            xsd:string { pattern = "\d{8}_\d{4}" }
         },
         element OrderID {
            xsd:string { pattern = "\d{4}" }
         }
     }+
    } &
   element queryStatus {
      xsd:string {
         pattern = "Success"
      }
   }
}

(This one will match one or more Order elements in an Orders, not just 2, if that matters).

Example, using the jing validator (Written in java, and looks like it can be used from your own java code):

$ jing -c expected.rnc test[12].xml && echo "Files pass"
Files pass

If you can't rewrite your expected data XML for some reason, an XSLT stylesheet to convert it to standard RELAX-NG XML:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" 
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>
  
  <xsl:template match="/root">
    <element name="root" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
      <interleave>
        <xsl:apply-templates select="*"/>
      </interleave>
    </element>
  </xsl:template>

  <xsl:template match="Orders">
    <element xmlns="http://relaxng.org/ns/structure/1.0" name="Orders">
      <xsl:for-each select="Order">
        <element name="Order">
          <element name="DateTime">
            <data type="string">
              <param name="pattern"><xsl:value-of select="DateTime"/></param>
            </data>
          </element>
          <element name="OrderID">
            <data type="string">
              <param name="pattern"><xsl:value-of select="OrderID"/></param>
            </data>
          </element>
        </element>
      </xsl:for-each>
    </element>
  </xsl:template>
  
  <xsl:template match="queryStatus">
    <element xmlns="http://relaxng.org/ns/structure/1.0" name="queryStatus">
      <data type="string">
        <param name="pattern"><xsl:value-of select="."/></param>
      </data>
    </element>
  </xsl:template>
  
</xsl:stylesheet>

(This one will match the exact number of Order elements in the expected file.)

Examples:

$ xsltproc convert.xslt expected.xml > expected.rng
$ jing expected.rng test[12].xml && echo "Files pass"
Files pass
$ xmllint --noout --relaxng expected.rng test[12].xml
test1.xml validates
test2.xml validates
Shawn
  • 47,241
  • 3
  • 26
  • 60
0

Try an XSLT 3.0 transformation.

For each element in XML2, find the corresponding element in XML1 and test if the value in XML2 matches the regular expression in XML1: that is

<xsl:mode on-no-match="shallow-skip"/>
<xsl:template match="*[text()]">
  <xsl:if test="not(matches(., f:corresponding-pattern(.)))">
    <xsl:message>Mismatch!</xsl:message>
  </xsl:if>
</xsl:template>

That leaves the question of how to implement f:corresponding-pattern(). Probably a good way is to index all elements in XML1 by path:

<xsl:key name="by-path" match="*[text()]" use="path(.)"/>

and then f:corresponding-pattern(.) reduces to key('by-path', path(.), doc('xml1.xml'))

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

Many thanks for the RELAX-NG XML / XSD model solutions but they sound a bit complicated + the complexity of converting at least a hundred different XML outputs to RELAX-NG XML / XSD model sounds a bit daunting to me. ( May be because I am not so familiar with these approaches). but there is a simpler work around that I found looking on the internet : https://rhubbarb.wordpress.com/2009/04/20/comparing-xml-files/ This link suggested that to sort the XML elements and attributes via XSLT using xsltproc command in Bash:

xsltproc sort.xsl $var_ExpectedFile>tmpExpectedFile.xml
xsltproc sort.xsl $var_actualFile>tmpActualFile.xml

Content of sort.xsl

<?xml version="1.0" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
        <xsl:apply-templates/>
</xsl:template>
<xsl:template match="*">
        <xsl:element name="{name()}">
 <xsl:copy-of select="namespace::*"/>
                <xsl:for-each select="@*">
                        <xsl:sort select="name()"/>
                        <xsl:attribute name="{name()}">
                                <xsl:value-of select="."/>
                        </xsl:attribute>
                </xsl:for-each>
                <xsl:apply-templates>
                        <xsl:sort select="name()"/>
                        <xsl:sort select="@type" />
                        <xsl:sort select="@name" />
                        </xsl:apply-templates>
        </xsl:element>
</xsl:template>
</xsl:stylesheet>

If i strip all new lines from the XMLs, and sort the XMLs using the XSLT above, I am able to compare the XMLs using a simple

grep -f  $var_ExpectedFile $var_actualFile>

This has worked for me so far with the XMLs I have tried. Sounds a lot easier .. but need a review on this approach. Does it sound fool proof?

  • IMHO, sorting the XML elements and doing `grep -f` afterwards seems fragile and not expressive enough… even if it looks simple and can work for some of your XML documents to validate, if you want to validate an XML document where a given element/attribute is optional, or can only have a given number of sub-elements (say, between 2 and 4), I guess the only solution to handle this validation in a precise and tractable way is to use a XML specification, such as XSD (e.g., for the 2nd example of my comment, you can write ``) – ErikMD Aug 07 '21 at 15:45
  • Note anyway that the XSD can be automatically generated from the XML files to validate: see e.g. [this nice website](https://www.freeformatter.com/xsd-generator.html); but you could just as well find on the web [a dedicated XSD editor or a plugin for your Java IDE](https://www.google.com/search?q=%22XSD%22+%22editor%22+with+Java+support). Then, either the generated XSD files will be directly fine, or if you want to relax some constraints, modify them with your XSD editor (or manually, after browsing the W3C's XSD spec I mentioned [in my answer](https://stackoverflow.com/a/68674109/9164010)). – ErikMD Aug 07 '21 at 15:58