0

How to clean an XML file removing all elements not present in a provided XSD?

This does not work:

public static void Main()
{
    XmlTextReader xsdReader = new XmlTextReader(@"books.xsd");
    XmlSchema schema = XmlSchema.Read(xsdReader, null);

    XmlReaderSettings settings = new XmlReaderSettings();
    settings.Schemas.Add(schema);
    settings.ValidationType = ValidationType.Schema;
    settings.ValidationEventHandler += new ValidationEventHandler(ValidationCallBack);

    XmlReader xmlReader = XmlReader.Create(@"books.xml", settings);
    XmlWriter xmlWriter = XmlWriter.Create(@"books_clean.xml");
    xmlWriter.WriteNode(xmlReader, true);
    xmlWriter.Close();
    xmlReader.Close();
}
private static void ValidationCallBack(object sender, ValidationEventArgs args)
{
    ((XmlReader)sender).Skip();
}

When I use the above, instead of removing all "junk" tags, it removes only the first junk tag and leaves the second one. As far as why I need to accept this file, I am using an old SQLServer 2012 instance which requires the XML to match the XSD exactly even if the extra elements in the XML are not used by the application. I do not have control over the source XML which is provided by a 3rd party tool with an unpublished XSD.

Sample Files:
Books.xsd

<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="bookstore">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="book" maxOccurs="unbounded" minOccurs="0">
          <xs:complexType>
            <xs:sequence>
              <xs:element type="xs:string" name="title"/>
              <xs:element type="xs:float" name="price"/>
            </xs:sequence>
            <xs:attribute type="xs:string" name="genre" use="optional"/>
            <xs:attribute type="xs:string" name="ISBN" use="optional"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Books.xml

<bookstore>
  <book genre='novel' ISBN='10-861003-324'>
    <title>The Handmaid's Tale</title>
    <price>19.95</price>
    <junk>skdjgklsdg</junk>
    <junk2>skdjgklsdg</junk2>
  </book>
  <book genre='novel' ISBN='1-861001-57-5'>
    <title>Pride And Prejudice</title>
    <price>24.95</price>
    <junk>skdjgssklsdg</junk>
  </book>
</bookstore>

Code mostly copied from: Validating an XML against referenced XSD in C#

DKATyler
  • 914
  • 10
  • 16
  • Source XML file is ~500mb, the last input file had ~120K usable nodes, and ~800K unused nodes. So, a stream based approach is preferred. – DKATyler May 03 '18 at 20:15
  • You are missing following : while (reader.Read()) ; – jdweng May 04 '18 at 06:27
  • @jdweng tried that and modified question. Reader.Read() only raises ValidationEvent on the first invalid element of each node. It did at least remove both tags, just not the tag. – DKATyler May 04 '18 at 14:04
  • The issue is when you have items like 1,2,3,4,5 and you remove item '3" 4 becomes three and 5 becomes 4. Then you end up skipping the 4th item. So the solution is to enumerate backwards through the like for(i = list.Count() - 1; i >=0; i--) – jdweng May 04 '18 at 17:16

2 Answers2

1

If it's simply a question of removing all elements whose names don't appear anywhere in the schema, then it possibly feasible, as described below. However, in the general case (a) this doesn't ensure the instance will be valid against the schema (the elements might be in the wrong order, for example), and (b) it might remove elements that the schema actually allows (because of wildcards).

If the approach of removing unknown elements looks useful, you could do it as follows:

(a) write an XSLT stylesheet that extracts all the element names from the schema by looking for xs:element[@name] declarations, generating a document with the format:

<allowedElements>
  <allow name="book" namespace=""/>
  <allow name="isbn" namespace=""/>
</allowedElement>

(b) write a second (streamable) XSLT stylesheet:

<xsl:transform version="3.0" xmlns:xsl="....">
  <xsl:mode on-no-match="shallow-copy" streamable="yes"/>
  <xsl:key name="k" match="allow" use="@name, @namespace" composite="yes"/>
  <xsl:template match="*[not(key('k', (local-name(), namespace-uri()), doc('allowed-elements.xml'))]"/>
</xsl:transform> 
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • This sounds promising, trying to figure out how to use multiple XSLT docs and run the transform. – DKATyler May 04 '18 at 15:28
  • Haven't been able to get a working part B XSLT document. There's a missing paren somewhere and I haven't been able to guess it's location yet. – DKATyler May 08 '18 at 15:31
0

The below successfully removes all of the junk tags from the provided examples. The second xsl:template tag is applied first and matches everything except the specifically white-listed tags. Then the first xsl:template tag writes a copy of the nodes to XmlWriter.

Code:

public static void Main()
{
    XmlReader xmlReader = XmlReader.Create("books.xml");
    XslCompiledTransform myXslTrans = new XslCompiledTransform();
    myXslTrans.Load("books.xslt");
    XmlTextWriter myWriter = new XmlTextWriter("books_clean.xml", null);
    myXslTrans.Transform(xmlReader, null, myWriter);
    xmlReader.Close();
    myWriter.Close();
}

books.xslt

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:mode streamable="yes"/>
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="*[
  not(name()='bookstore') and
  not(name()='book') and
  not(name()='title') and
  not(name()='price')
  ]" />
</xsl:stylesheet>
DKATyler
  • 914
  • 10
  • 16