1

How can I programmatically fix the content of an XML document to conform with the maxLength restrictions of its schema (in this case: cut the content to 10 characters if longer)?

This very similar question asks how to insert default values based on the schema (unfortunately the answer was not detailed enough for me).

The API documentation of ValidatorHandler says:

ValidatorHandler checks if the SAX events follow the set of constraints described in the associated Schema, and additionally it may modify the SAX events (for example by adding default values, etc.)

I looked at usages of Schema.newValidatorHandler() and ValidatorHandler.setContentHandler() on tabnine.com, but I couldn't find any examples that modify the input stream.

Example Schema:

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="a">
    <xs:simpleType>
      <xs:restriction base="xs:string">
        <xs:maxLength value="10" />
      </xs:restriction>
    </xs:simpleType>
  </xs:element>
</xs:schema>

Example XML document:

<?xml version="1.0" encoding="UTF-8" ?>
<a xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd">0123456789x</a>

Example validation error:

cvc-maxLength-valid: Value '0123456789x' with length = '11' is not facet-valid with  respect to maxLength '10' for type '#AnonType_a'.

validation-error

Current attempts (this code uses the javax.xml APIs, but I am open to any solution at all):

import java.io.File;
import javax.xml.XMLConstants;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Validator;
import javax.xml.validation.ValidatorHandler;

public class Test {
  public static void main(String[] args) throws Exception {
    SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
    Schema schema = schemaFactory.newSchema(new File("schema.xsd"));

    // validation
    Validator validator = schema.newValidator();
    validator.validate(new StreamSource(new File("document.xml")));

    // modify stream while parsing?
    ValidatorHandler validatorHandler = schema.newValidatorHandler();
    validatorHandler.setErrorHandler(?);
    validatorHandler.setContentHandler(?);
    validatorHandler.setDocumentLocator(?);

    SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
    saxParser.parse(new File("document.xml"), ?); // only accepts DefaultHandler or HandlerBase
  }
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Reto Höhener
  • 5,419
  • 4
  • 39
  • 79
  • XML schemas can be complicated. Even your example schema could have been written by declaring a separate `` element, followed by ``. To really accomplish your task would require writing most of an XML schema parser. – VGR May 24 '22 at 20:30
  • 1
    Couldn't it be done [with](https://stackoverflow.com/questions/34107282/use-xslt-to-transform-xml-to-text-with-maximum-width/34114577#34114577) [XSLT](https://stackoverflow.com/questions/19759169/maximum-length-xslt-but-keep-full-paragraphs-in-output) [alone](https://stackoverflow.com/questions/14219346/why-is-the-maxlength-attribute-in-an-xsd-not-restricting-the-number-of-charact)? (Not a rhetorical question.) – Peter Mortensen May 25 '22 at 22:00
  • I mean XSLT can definitely work but it feels to me like a workaround that can carry big implications. An XML which is not conform XSD can imply two things. 1) the document is garbage. You can try to "fix" it but that is pretty much automating chaos and it becomes your responsibility if the "fixed" version ends up being even more garbage down the pipeline. 2) it might imply the XSD is actually faulty and needs to be corrected. In this case... 2) seems very likely to be honest. – Gimby May 27 '22 at 15:10
  • @PeterMortensen Thank you for the links! I haven't used XSLT before and will have a close look at it. – Reto Höhener May 30 '22 at 07:17
  • @Gimby Both your points apply: 1 (questionable input) and 2 (questionable schema). As usual, we have limited options / resources of changing either, and still have to make it work somehow, even if it means throwing away some of the input. Also I am hoping that the 'auto-correction' mechanism would maybe lead to some diagnostics output which could help us improve the input validation in the long term. – Reto Höhener May 30 '22 at 07:49

1 Answers1

0

I managed to implement a solution based on Schema.newValidatorHandler(). I lost most time with the fact that SaxParser.parse() only accepts a DefaultHandler. To insert a custom ContentHandler, one has to use SaxParser.getXMLReader().setContentHandler().

I am aware that this proof of concept is not very robust, because it is parsing the validation error message to extract the maxLength schema information. So this solution is relying on a very specific SAX implementation.

I looked at schema aware XSLT transformation, but could not find any indication that the schema information can be accessed in the transformation expressions.

Writing my own specialized schema parser is still not completely off the table.

import java.io.IOException;
import java.io.StringReader;
import java.util.Map.Entry;
import java.util.SortedMap;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.xml.XMLConstants;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.ValidatorHandler;

import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

public class TestFixMaxLength {
    public static void main(String[] args) throws Exception {
        SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
        Schema schema = schemaFactory.newSchema(new StreamSource(TestFixMaxLength.class.getResourceAsStream("schema.xsd")));

        // validation on original input should fail
        // schema.newValidator().validate(new StreamSource(TestFixMaxLength.class.getResourceAsStream("input.xml")));

        CustomContentHandler customContentHandler = new CustomContentHandler();
        ValidatorHandler validatorHandler = schema.newValidatorHandler();
        validatorHandler.setContentHandler(customContentHandler);
        validatorHandler.setErrorHandler(customContentHandler);

        SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
        saxParserFactory.setNamespaceAware(true);
        SAXParser saxParser = saxParserFactory.newSAXParser();

        XMLReader xmlReader = saxParser.getXMLReader();
        xmlReader.setContentHandler(validatorHandler);
        xmlReader.parse(new InputSource(TestFixMaxLength.class.getResourceAsStream("input.xml")));
        // not: saxParser.parse(TestFixMaxLength.class.getResourceAsStream("input.xml"), ???);

        System.out.println();
        System.out.println();
        System.out.println(customContentHandler.m_outputBuilder.toString());

        // validation on corrected input should pass
        schema.newValidator().validate(new StreamSource(new StringReader(customContentHandler.m_outputBuilder.toString())));
    }

    /****************************************************************************************************************************************/
    private static class CustomContentHandler extends DefaultHandler {
        private StringBuilder m_outputBuilder = new StringBuilder();
        private SortedMap<String, String> m_prefixMappings = new TreeMap<>();
        private int m_lastValueLength = 0;
        private Matcher m_matcher = Pattern.compile(
                "cvc-maxLength-valid: Value '(.+?)' with length = '(.+?)' is not facet-valid with respect to maxLength '(.+?)' for type '(.+?)'.",
                Pattern.CASE_INSENSITIVE | Pattern.DOTALL).matcher("");

        @Override
        public void error(SAXParseException e) throws SAXException {
            if (e.getMessage().startsWith("cvc-maxLength-valid")) {
                System.out.println("error: " + e);
                m_matcher.reset(e.getMessage());
                if (m_matcher.matches()) {
                    int maxLength = Integer.parseInt(m_matcher.group(3));
                    m_outputBuilder.setLength(m_outputBuilder.length() - m_lastValueLength + maxLength);
                } else {
                    System.out.println("unexpected message format");
                }
            }
        }

        @Override
        public void startDocument() throws SAXException {
            System.out.println("startDocument");
        }

        @Override
        public void endDocument() throws SAXException {
            System.out.println("endDocument");
        }

        @Override
        public void startPrefixMapping(String prefix, String uri) throws SAXException {
            System.out.println("startPrefixMapping: prefix: " + prefix + ", uri: " + uri);
            m_prefixMappings.put(prefix, uri);
        }

        @Override
        public void endPrefixMapping(String prefix) throws SAXException {
            System.out.println("endPrefixMapping: prefix: " + prefix);
        }

        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes)
                throws SAXException {
            System.out.println("startElement: uri: " + uri + ", localName: " + localName + ", qName: " + qName
                    + ", attributes: " + attributes.getLength());

            m_outputBuilder.append('<');
            m_outputBuilder.append(qName);

            for (int i = 0; i < attributes.getLength(); i++) {
                m_outputBuilder.append(' ');
                m_outputBuilder.append(attributes.getQName(i));
                m_outputBuilder.append('=');
                m_outputBuilder.append('\"');
                m_outputBuilder.append(attributes.getValue(i));
                m_outputBuilder.append('\"');
            }

            if (!m_prefixMappings.isEmpty()) {
                for (Entry<String, String> mapping : m_prefixMappings.entrySet()) {
                    m_outputBuilder.append(" xmlns:");
                    m_outputBuilder.append(mapping.getKey());
                    m_outputBuilder.append('=');
                    m_outputBuilder.append('\"');
                    m_outputBuilder.append(mapping.getValue());
                    m_outputBuilder.append('\"');
                }

                m_prefixMappings.clear();
            }

            m_outputBuilder.append('>');
        }

        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            System.out.println("endElement: uri: " + uri + ", localName: " + localName + ", qName: " + qName);

            m_outputBuilder.append('<');
            m_outputBuilder.append('/');
            m_outputBuilder.append(qName);
            m_outputBuilder.append('>');
        }

        @Override
        public void characters(char[] ch, int start, int length) throws SAXException {
            System.out.println(
                    "characters: '" + new String(ch, start, length) + "', start: " + start + ", length: " + length);

            m_outputBuilder.append(ch, start, length);
            m_lastValueLength = length;
        }

        @Override
        public void skippedEntity(String name) throws SAXException {
            System.out.println("skippedEntity: name: " + name);
        }

        @Override
        public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException {
            System.out.println("ignorableWhitespace: '" + new String(ch, start, length) + "', start: " + start
                    + ", length: " + length);
            m_outputBuilder.append(ch, start, length);
        }

        @Override
        public void processingInstruction(String target, String data) throws SAXException {
            System.out.println("processingInstruction: target: " + target + ", data: " + data);
        }

        @Override
        public InputSource resolveEntity(String publicId, String systemId) throws IOException, SAXException {
            System.out.println("resolveEntity");
            return null;
        }
    }
}
Reto Höhener
  • 5,419
  • 4
  • 39
  • 79