1

Assume the following:

  1. I have a set of XSD schemas S, each with distinct namespace URIs.

  2. I know that I'm going to be receiving an XML document containing a root element that contains exactly one namespace declaration that refers to a member of S. I can abort parsing immediately with an error if I don't receive exactly one namespace declaration, or if the received namespace doesn't refer to any schema in S.

I want to parse the incoming XML document with a SAX parser, and I want to validate the incoming document during parsing against one of the schemas in S. I know from the above that the first call I'm going to see in the ContentHandler that I give to the parser will be a call to startPrefixMapping when the parser encounters the namespace declaration.

Is it possible to, in the startPrefixMapping call, pick one of the schemas in S for validation once I know which one I need?

It seems that I could maybe call setSchema on the parser inside the startPrefixMapping call, but I get the feeling from the API documentation that I'm not supposed to do this (and that it may be too late to call the method at that point anyway).

Is there some other way to supply a set of schemas to the parser and perhaps have it pick the right one itself based on the namespace declaration it receives?

Edit: I was wrong, it's not just inadvisable to call setSchema on a parser once parsing has started - it's actually impossible. Parsers don't expose a setSchema call, only parser factories do. This means that my options are limited to those that can allow the parser to select a schema for itself. Unfortunately, that has its own problems: It's not possible for an XML document to merely specify a namespace, it also has to specify a filename for the intended schema (which in my opinion is an implementation detail on the parser side and should not be required of the incoming data) and the parser has to intercept the request for this filename to supply a member of S for validation.

Edit: I've solved this. I've put together some heavily-commented public domain example code here that looks up schemas based on pre-specified systemIds, and the schemas are delivered programatically (so they can be served from databases, class resources, etc). It correctly rejects any document that specifies an unknown schema, specifies no schema, or tries to specify its own schemaLocation to try to fool the validator.

https://github.com/io7m/xml-schema-lookup-example

oceanic
  • 59
  • 4
  • Why not leverage the standard `schemaLocation` or `noNamespaceSchemaLocation` mechanisms to make the association rather than try to recreate such a mechanism yourself? That way, the SAX parser will automatically make the association for you. See duplicate link for details. – kjhughes Dec 05 '17 at 17:52
  • That may be an option, but it seems slightly unpleasant: The XML document would be specifying a filename and the parser would be responsible for resolving that filename. Aside from the fact that the schemas are not accessible via any filesystem (they're stored as resources in a jar file), this means that the incoming document has to specify _both_ the correct namespace _and_ the correct filename (in order for a resolver given to the SAX parser to correctly find a schema). – oceanic Dec 05 '17 at 19:44
  • Consider too [OASIS XML Catalogs](https://www.oasis-open.org/committees/download.php/14810/xml-catalogs.pdf) and [entity resolvers](http://www.saxproject.org/apidoc/org/xml/sax/EntityResolver.html). – kjhughes Dec 05 '17 at 20:20
  • Thanks, I think I probably have enough to work with here. I'll accept this as the answer, but I'm not exactly sure if it'll fit everything I need to do yet. – oceanic Dec 05 '17 at 20:25
  • Glad to help. Once you're underway, feel free to post new, more specific questions as they arise. Good luck. – kjhughes Dec 05 '17 at 20:32
  • One reason to set the schema in a call to the validator instead of in the XML document itself is that in cases where XML data is crossing trust boundaries and needs to be validated, it would be self-defeating to take for granted that the input is pointing to the schema you want to validate it against. If an adversarial data source is trying to sneak data past you that doesn't conform to the agreement they made with you, you don't want their task to be as simple as pointing to a different set of schema documents ... – C. M. Sperberg-McQueen Dec 06 '17 at 02:17
  • C.M: I'm not sure which one of us you're responding to, but I agree. I have incoming data which can only be one of several types defined by distinct schemas. Any data that doesn't immediately declare itself to be of one of those types must be immediately rejected. The sole problem here is that I need to pick a schema once I see which schema that data is claiming to conform to. Having experimented with catalogs and entity resolvers, I don't see how those can help me do this at all. – oceanic Dec 06 '17 at 12:32
  • @kjhughes: Is there any chance you could remove the "duplicate" tag? I don't believe this question is a duplicate of the linked question. They are similar but not the same: The other question is asking how the author of a document can associate a schema with that document. My question is asking how the *parser* of a document can associate the document with a schema based on the namespace alone. – oceanic Dec 06 '17 at 15:09

0 Answers0