1

I recently needed to implement schema validation in java with a schema that imports another schema (from henceforth I'll refer to this as a schema hierarchy). Much to my surprise I found out a schema hierarchy greatly complicates the code utilized for simple standalone schema validation. My understanding of the fix (and what I created) was an impl of the LSResourceResolver interface and an impl of the LSInput interface to return. My understanding is this is necessary when needing to validate against a hierarchy of schemas.

I find this frustrating because once the validator has a handle to the root schema, any imports are simply relative to that location. Wanting to make validation easier and reusable, I proceeded to create a resolver that would ultimately simplify schema validation to two inputs for every situation.

  1. What's the root schema
  2. What's the payload you want to validate.

In other words my goal is to make something like the following work for any schema structure:

XmlValidator validator = new XmlValidator("some/dir/root.xsd");
validator.validate("<xml><someXml/></xml>");

When looking at the documentation for the function that is called to load resources, you find out that the first issue is the resolver isn't called to load the root resource (root schema). You need that root schema's path to be able to look up the other relative paths from it. This can be overcome by passing the root path into the constructor for the resolver and tracking it manually.

Then comes the roadblock. The systemId parameter reliably contains the resource trying to be resolved/loaded (this string is exactly what the import/include/redefine schemaLocation attribute is). For example: If the current schema you are loading has this line:

<xsd:include schemaLocation="../given/redefine.xsd"/>

The systemId when loading redefine.xsd will be:

"../given/redefine.xsd"

However, the baseURI parameter which is supposed to hold the resource that was previously being loaded (which you must know because you're creating a relative path based off of the previous resource's location) can be null, and in my experience is for 2/3 of the schemas that will be loaded.

This is the point where I feel the java internal validation cannot provide the solution I'm looking for. The problem we are trying to solve seems very simple. Given a root schema, load all other included schema's based off of the root schema's location. Unless I'm missing something, this is now impossible because baseURI can be null and thus the previous schema cannot be tracked.

Surely we can't be this far along in java's lifetime and this problem isn't resolved. What am I missing here? Is it correct that it is now impossible to write a validation utility and only feeding the two above inputs? What are others using for schema validation? I have to believe others don't constantly keep rolling custom resolver classes to dance around a schema hierarchy (which should be fairly common).

Here is a simple representation of the problem trying to be solved. I am looking for the simplest, most java-like way to solve this sample problem:

Assume sample project structure of:

src/main/java/sandbox/TestValidation.java

src/main/resources/sandbox/sample.xml
src/main/resources/sandbox/custom/wrapper.xsd
src/main/resources/sandbox/custom/candy.xsd
src/main/resources/sandbox/given/base.xsd
src/main/resources/sandbox/given/redefine.xsd

TestValidation.java:

import javax.xml.XMLConstants;
import javax.xml.transform.Source;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Validator;

import org.xml.sax.SAXException;

import com.blarg.validation.XmlValidator;


public class SchemaValidationTest {

    public SchemaValidationTest() throws Exception {
//      The linked suggested solution which fails because
//      it cannot load the first referenced schema
        Source schemaFile = new StreamSource(
                getClass().getClassLoader()
                .getResourceAsStream("sandbox/custom/wrapper.xsd"));
        Source xmlFile = new StreamSource(
                getClass().getClassLoader()
                .getResourceAsStream("sandbox/sample.xml"));
        SchemaFactory schemaFactory = SchemaFactory
            .newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
        Schema schema = schemaFactory.newSchema(schemaFile);
        Validator validator = schema.newValidator();
        try {
          validator.validate(xmlFile);
          System.out.println(xmlFile.getSystemId() + " is valid");
        } catch (SAXException e) {
          System.out.println(xmlFile.getSystemId() + " is NOT valid");
          System.out.println("Reason: " + e.getLocalizedMessage());
        }

//      My custom validator which succeeds all the way until
//      it reaches the candy.xsd for reasons described above and again below. 
        XmlValidator customValidator = new XmlValidator("sandbox/custom/wrapper.xsd");
        customValidator.validate(getClass().getClassLoader().getResourceAsStream("sandbox/sample.xml"));
    }

    public static void main(String[] args) throws Exception {
        new SchemaValidationTest();
    }
}

sample.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Wrapper> <!-- wrapper.xsd -->
    <GiftBasket>
        <Fruit> <!-- base.xsd -->
            <Apple>
                <Size>medium</Size>
                <Color>Red</Color> <!-- redefine.xsd -->
            </Apple>
            <Orange>
                <Size>large</Size>
            </Orange>
        </Fruit>
        <Candy> <!-- candy.xsd -->
            <Caramel>salted</Caramel>
        </Candy>
    </GiftBasket>
</Wrapper>

wrapper.xsd

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">
    <xsd:include schemaLocation="../given/redefine.xsd"/>
    <xsd:include schemaLocation="./candy.xsd"/>
    <xsd:element name="Wrapper">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element name="GiftBasket" type="GiftBasket_Type" minOccurs="1" maxOccurs="1"/>
            </xsd:sequence>
        </xsd:complexType>
    </xsd:element>
    <xsd:complexType name="GiftBasket_Type">
        <xsd:sequence>
            <!-- From base.xsd (and apple is redefined in redefine.xsd) -->
            <xsd:element name="Fruit" type="Fruit_Type" minOccurs="1" maxOccurs="1"/>
            <!-- From candy.xsd -->
            <xsd:element name="Candy" type="Candy_Type" minOccurs="0" maxOccurs="1"/>
        </xsd:sequence>
    </xsd:complexType>
</xsd:schema>

base.xsd

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="unqualified" attributeFormDefault="unqualified">
    <xsd:complexType name="Fruit_Type">
        <xsd:sequence>
            <xsd:element name="Apple" type="Apple_Type" minOccurs="0" maxOccurs="unbounded" />
            <xsd:element name="Orange" type="Orange_Type" minOccurs="0" maxOccurs="unbounded" />
        </xsd:sequence>
    </xsd:complexType>
    <!-- This is redefined in redefine.xsd to include additional elements -->
    <xsd:complexType name="Apple_Type">
        <xsd:sequence>
            <xsd:element name="Size" type="xsd:string" minOccurs="0" maxOccurs="1" />
        </xsd:sequence>
    </xsd:complexType>
    <xsd:complexType name="Orange_Type">
        <xsd:sequence>
            <xsd:element name="Size" type="xsd:string" minOccurs="0" maxOccurs="1" />
        </xsd:sequence>
    </xsd:complexType>
</xsd:schema>

redefine.xsd:

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="unqualified" attributeFormDefault="unqualified">
    <xsd:redefine schemaLocation="./base.xsd">
        <xsd:complexType name="Apple_Type">
            <xsd:complexContent>
                <xsd:extension base="Apple_Type">
                    <xsd:sequence>
                        <xsd:element name="Color" type="xsd:string"/>
                    </xsd:sequence>
                </xsd:extension>
            </xsd:complexContent>
        </xsd:complexType>
    </xsd:redefine>
</xsd:schema>

candy.xsd:

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">
    <xsd:complexType name="Fruit_Type">
        <xsd:choice>
            <xsd:element name="Chocolate" type="xsd:string" minOccurs="0" maxOccurs="unbounded" />
            <xsd:element name="Caramel" type="xsd:string" minOccurs="0" maxOccurs="unbounded" />
        </xsd:choice>
    </xsd:complexType>
</xsd:schema>

If you care to see my current impl of LSResourceResolver which gets me close to a solution it is shown below. If the import for candy.xsd and the referenced element is removed from wrapper.xsd and sample.xml this validates. The reason it does not work is because when candy.xsd is being loaded, the previous loaded path was in sandbox/given and the systemId passed in will be ./candy.xsd so it will look for candy.xsd in the wrong location:

package com.blarg.validation;

import java.io.InputStream;
import java.util.LinkedList;

import org.w3c.dom.ls.LSInput;
import org.w3c.dom.ls.LSResourceResolver;

import com.blarg.validation.exception.SchemaNotFoundException;

public class SchemaResolver implements LSResourceResolver {

    private String path;
    private ClassLoader classLoader;
    private SchemaTracker tracker;

    public SchemaResolver(String path, ClassLoader classLoader, SchemaTracker tracker) {
        this.path = path;
        this.tracker = tracker;
        this.classLoader = classLoader;
    }

    public LSInput resolveResource(String type, String namespaceURI, String publicId, String systemId, String baseURI) {
        String classloaderPath = generateClassloaderResourcePath(path, systemId);
        tracker.setLastLoadedSchema(classloaderPath);
        InputStream is = classLoader.getResourceAsStream(classloaderPath);
        if (is == null) {
            throw new SchemaNotFoundException("Loading the root schema succeeded, but the following referenced schema could not be found: '"
                    + classloaderPath
                    + "' Make sure the root schema and referenced schemas are all in the same directory. Then verify any <xsd:include>, "
                    + "<xsd:import>, or <xsd:redefine> tags all have correct 'schemaLocation' attribute values.");
        }
        /*
         * Store the last used path so the next schema lookup is relative to it.
         * This is a hack and will only work if:
         * some/dir/a.xsd imports some/dir/another/b.xsd
         * and some/dir/another/b.xsd imports some/dir/other/c.xsd
         * etc..
         * 
         * It will *not* work for:
         * some/dir/a.xsd imports some/dir/another/b.xsd
         * some/dir/another/b.xsd imports some/dir/other/c.xsd
         * etc..
         * AND
         * some/dir/a.xsd also imports some/dir/d.xsd
         * 
         * It will fail loading d.xsd because the last stored path
         * will be /some/dir/other and the systemId coming in will
         * be "./d.xsd"
         */
        path = classloaderPath.substring(0, classloaderPath.lastIndexOf("/") + 1);
        return new SchemaInput(publicId, systemId, is);
    }


    private String generateClassloaderResourcePath(String path, String systemId) {
//          fullPath may contain ./ or ../ which is not allowed in classloader resource lookups.
        String fullPath = path + systemId;

        LinkedList<String> linkedList = new LinkedList<String>();
        String current = first(fullPath);
        while (current != null) {
            if (".".equals(current)) {
//                  Do nothing, dot represents the current directory so we have it already
            } else if ("..".equals(current)) {
//                  Remove the lastly added directory because we need to go up
                linkedList.removeLast();
            } else {
//                  The directory is just a normal directory or filename, add it
                linkedList.add(current);
            }
            fullPath = removeFirst(fullPath);
            current = first(fullPath);
        }

        String classLoaderPath = "";
        while (linkedList.size() > 0) {
            classLoaderPath = classLoaderPath + linkedList.removeFirst() + "/";
        }
        classLoaderPath = classLoaderPath.substring(0, classLoaderPath.length() - 1);
        System.out.println("classLoaderPath: " + classLoaderPath);
        System.out.println();
        return classLoaderPath;
    }

    private String first(String path) {
        if (path == null) {
            return null;
        } else if (path.contains("/")) {
            return path.substring(0, path.indexOf("/"));
        } else {
            return path;
        }
    }

    private String removeFirst(String path) {
        if (path.contains("/")) {
            return path.substring(path.indexOf("/") + 1);
        } else {
            return null;
        }
    }
}

You of course needs to instantiate it correctly (give it the correct path to the root schema and register it with the schemaFactory using:

schemaFactory.setResourceResolver(new SchemaResolver(pathToSchemas, classLoader, tracker));
Russ
  • 1,996
  • 3
  • 19
  • 31
  • Schema validation does not require a custom resource resolver to handle `xsd:import`. – kjhughes Jan 22 '15 at 15:17
  • And what about xsd:include and xsd:redefine? I'm happy to try any code sample you provide, but I tried just loading my root schema and it could not load the referenced resources. My particular schema layout is a.xsd includes b.xsd which redefines c.xsd. – Russ Jan 22 '15 at 15:30
  • Default resource resolution should suffice. If you cannot say precisely why your needs are different, then you're probably barking up the wrong tree to be writing a custom resource resolver. Sample Java code for doing validation abounds, e.g: [What's the best way to validate an XML file against an XSD file?](http://stackoverflow.com/q/15732/290085) – kjhughes Jan 22 '15 at 16:14
  • I think you're making things far more complicated than they need to be (but I can't really tell, because I can't see what your problem was before you started digging a hole for yourself). If the baseURI passed to a uri resolver class is null, this is generally because you started the process with a Source whose systemId is null, e.g. a StreamSource that only contains an InputStream, or a DOMSource with no systemId. – Michael Kay Jan 22 '15 at 16:15
  • 4
    Another comment: trying to be helpful, but I know it might not be appreciated. If you want help from the people who are experts in this stuff, then it's quite possible that they use it successfully every day, and regards anyone who thinks it is broken as a bit of a jerk. They might well be wrong, but you want them on your side. So treat them nicely, and treat their favourite technology with respect. I'm not one of them, by the way: I have implemented these interfaces and I know their faults. But they are not irretrievably broken. – Michael Kay Jan 22 '15 at 16:22
  • 1
    Java is my primary language and is my "favourite technology". So I'm one of those people. If being critical of the language I'm going to be using every day and wanting it to be better by calling out it's issues and working to resolve them on a public site like SO so other developers can benefit makes me a "jerk", then so be it. The solution kjhughes provided was the first I tried and doesn't work. Amazing how critical these comments can be without any concrete answers to the question at hand. Stick to the technical topic or save your breath. – Russ Jan 23 '15 at 13:38
  • +kjhughes, have you tried your linked solution given the constraints I already stated? I can say what's unique about my situation,can you read what's unique about it? Look at the relationship between schema a, b, and c which I've already stated. Try your solution against a schema set like that and let me know how well that solution works - because it doesn't. – Russ Jan 23 '15 at 13:43
  • Updated the question with a simple code example that depicts the problem. Perhaps @kjhughes can understand now why their linked solution does not work. Also maybe Michael Kay can see what originated the hole I started digging. I welcome the idea that java's xml validation may not be broken and hope it isn't. I understand I might not know something about an easier way to make this work and I welcome those suggestions.I will gladly change the title of this question if it turns out java's xml validation does handle this gracefully. Currently I don't see a way. – Russ Jan 23 '15 at 15:45
  • @kjhughes Default resolution usually works, but I also came across some situations where it didn't work as expected, especially in cases like these where schemaLocation contained "..". I also had to choose between patching the schema files and using a custom resolver, – Drunix Jan 23 '15 at 15:51
  • And in this case, modifying the schemas (and their locations) is not an option because they are residing in a jar maintained by another entity I have no governance over. Their argument is "our schema's are valid we don't need to change anything" - and they're right. So once modifying the schema's locations is no longer an option, you are left with needing to solve the above problem. Again, I welcome better solutions than the one I've provided. If anyone is interested I can provide full code examples of what I've got today but there is already a lot of code posted and don't want to spam. – Russ Jan 23 '15 at 16:00
  • @Drunix, of course there are reasons to write custom resolvers, but the presence of `xsd:import`, `xsd:include`, or `xsd:redefine` is not one of them. OP: *A schema hierarchy greatly complicates the code utilized for simple standalone schema validation* is false hyperbole, as is the title of the post, *Java Schema Validation Is Broken*. Your chances of receiving help would increase if you lost the combative tone and if you provided a [**Minimal, Complete, and Verifiable Example (MCVE)**](http://stackoverflow.com/help/mcve) that exhibits your problem. Good luck. – kjhughes Jan 24 '15 at 17:26
  • Excellent advice @kjhughes - it's already there. I have the 4 tiniest schemas I could create to present the problem and already posted them on the page with the simplest project org that exhibits the problem. I have the sample java tester class using your suggested solution - which shows your suggestion fails. Text holds no tone. There is nothing combative from me on this page other than facts. Have you noticed everything you've provided so far is factually incorrect? – Russ Jan 26 '15 at 14:39
  • I have updated the title to pose it as a question and am more than happy to modify the title to anything that helps people get over their personal issues with it and focus on the issue at hand. Just give a suggestion in the comment that you feel is descriptive and I'll update it. If there is something missing from the MCVE above, let me know. I did my best to make it as small and simple as possible the schemas are all very tiny and even have a fun fruit basket structure! If you're patient and set the test up as described, you will see it fail. – Russ Jan 26 '15 at 14:51
  • @kjhughes still waiting for your solution. You seemed very knowledgeable on this topic. – Russ Jan 26 '15 at 18:57

1 Answers1

1

Not sure if it answers your question, but here's my take.

The "Java internal schema validation solution" is, essentially, a repackaged Xerces.

So if you're asking if Xerces is broken - no, it is not.

If you asking if it may have bugs - yes, it may have some.

What does the answer to the question "is it broken?" bring you, actually?

"No it is not" would probably contradict your experience - and this will be somehow our task to persuade you that it is not broken.

"Yes it is" - well, fine, many things are, the question what do we do about it.

I think that the right way to go about it is to create a reproducable example and file an issue in XercesJ.

You get null baseURI in some cases? File an issue.

Relative URIs do not get resolved? File an issue.

Although there are caveats, I can't really confirm that schema validation is completely broken. I normally put all of the schemas into classpath as resources and load the root schema from the classpath resource URI. In my experience, relatively-referenced schemas are normally resolved OOTB. So I guess you're hitting some corner cases. However the schemas I normally work with are also far from ideal. In some cases I had to use catalog resolvers to rewrite absolute URI, but in general I normally got things work at the end.

I actually understand your pain. I've also hit a couple of corners with resolvers (but in a different environment), so this is no surprise you're frustrated. But at the end of the day what would matter is not how well you've argumented that it is broken but if you managed to fix it or not. This is what matters.

So good luck and stay constructive. :)

lexicore
  • 42,748
  • 17
  • 132
  • 221
  • This is certainly my goal - to fix it. I had 2 goals of this post: to point out java's schema validation is broken (the sample schemas and sample test class show that), and to provide a better way (my provided schema resolver class). The frustration you guys are sensing is frustration that I can't provide a single, better, ridiculously easy way to validate against any schema structure. I very badly want to make schema validation "ridiculously easy" for all. At the end of the day the only necessary inputs are the root schema, and the payload. I badly, badly want to achieve that goal. – Russ Jan 26 '15 at 15:32
  • Are there any other XercesJ alternatives you know of? – Russ Jan 26 '15 at 15:37
  • 1
    @Russ Then I hope you can see how this is not fitting in SO format. "to point out it is broken" and "to provide a better way" aren't quite on topic on SO from my PoV. As for the other question (which would be also offtopic "recommend a tool") - I'm not quite aware, only used Xerces/built-in last years. There was also MSV back then and I was also using Schematron and RelaxNG apart from XSD, but this does not answers your queston. At the moment I (personally) am not aware of an alternative. – lexicore Jan 26 '15 at 15:44
  • 3
    @Russ I would't say ranting with anger and resorting to insults is a good strategy. If you're interested in why do people react to your question as they do react, feel free to ask on [meta](http://meta.stackoverflow.com/). I'll be glad to provide my feedback if you'll ask for it there. – lexicore Jan 26 '15 at 19:12
  • http://meta.stackoverflow.com/questions/284598/what-is-wrong-with-the-linked-so-post – Russ Jan 26 '15 at 20:36