1

The ultimate goal is to prevent whitespace in a pool of validated XML content by simply not allowing bad xs:token content to pass schema validation for relevant elements. Schema-invalid instances are not allowed into the pool.

If I declare an element's type to be xsd:token in an XML Schema (1.1) and I try to validate an instance of this schema where the xsd:token-typed element contains more than zero of the repudiated characters (tab, LF, CR) or a double, leading or trailing space, will said instance validate or not?

Assume: there's no other "restriction" (so to speak) on the content, only that it has to be an xsd:token.

Extension just to be totally clear: "The setting xs:whiteSpace=collapse means that leading and trailing whitespace is removed and internal whitespace is reduced to a single x20 character" - I understand that this is a "pre-validation / internal" (so to speak) step for the XML validator; is that right?

Michael
  • 285
  • 5
  • 14

1 Answers1

3

Your question reveals an incorrect assumption by talking of whitespace "restrictions". The xs:whiteSpace facet does not define restrictions, it define normalizations: ie. what happens to whitespace before validation is applied. In most cases whitespace is collapsed, which means that leading and trailing whitespace is removed, and internal whitespace is reduced to a single space character. If there is a pattern facet then it applies to the value after this whitespace normalization has been done.

For xs:token, note that the name of the type is highly misleading. An instance of xs:token can contain whitespace. The setting xs:whiteSpace=collapse means that leading and trailing whitespace is removed and internal whitespace is reduced to a single x20 character; the result will always be a valid instance of xs:token.

(Of course, the normalized value after validation is of interest only if you are processing the post-validation infoset, for example by using schema-aware XSLT or XQuery. If you are only doing validation to get an error if it's invalid, then xs:token and xs:string are completely equivalent.)

Michael
  • 285
  • 5
  • 14
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thanks! This - "If you are only doing validation to get an error if it's invalid, then xs:token and xs:string are completely equivalent" - is the answer I was looking for! :-) I've edited the question thanks to your clarification: "The setting xs:whiteSpace=collapse means that leading and trailing whitespace is removed and internal whitespace is reduced to a single x20 character". – Michael Oct 19 '16 at 07:24
  • 1
    I should qualify the statement that "xs:token and xs:string are completely equivalent." This isn't true if you want to create a type derived by restriction using a pattern facet. The pattern facet applies to the value AFTER whitespace normalization, so the same pattern will give different effects in the two cases. – Michael Kay Oct 19 '16 at 11:50
  • Thank you Mr. Kay - that is a very helpful answer for my other use case(s). – Michael Oct 19 '16 at 15:08
  • I was looking at the Schema spec again today (as you do), and noticed this bit of the definition of [`xs:token`](https://www.w3.org/TR/xmlschema-2/#token): > "The ·lexical space· of token is the set of strings that do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, that have no leading or trailing spaces (#x20) and that have no internal sequences of two or more spaces." Doesn't this mean those characters are invalid before we get to normalization ([cvc-datatype-valid](https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html#cvc-datatype-valid))? – mrg Mar 03 '21 at 10:33
  • 1
    You missed the fact that the "lexical space" is not what's in the raw XML, it's what you get after applying "pre-lexical" normalisations - which essentially means `xs:whiteSpace`. – Michael Kay Mar 03 '21 at 12:43
  • Thanks. I couldn't find anywhere in Schema Part 2 that says this explicitly, but it makes sense, and the specification for base64Binary assumes it: "Any string compatible with the RFC can occur in an element or attribute validated by this type, because the ·whiteSpace· facet of this type is fixed to collapse, which means that all leading and trailing whitespace will be stripped, and all internal whitespace collapsed to single space characters, before the above grammar is enforced" https://www.w3.org/TR/xmlschema-2/#base64Binary – mrg Mar 06 '21 at 01:29