2

I want write an XSD to restrict the content of valid XML elements of type xsd:token such that at validation they would indistinguishable from the same content wrapped in xsd:string.

I.e. they do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, begin or end with a space (#x20) character, and do not include a sequence of two or more adjacent space characters.

I think the regular expression to use is this:

\S+( \S+)*

(some non-whitespace, optional [single spaces next to one or more non-whitespaces], including always non-whitespace to close out)

This works with various regex testing tools but I can't seem to check it using oXygen XML Editor; double spaces, leading and trailing spaces, tabs, and line breaks in the strings seem to allow the XML instance to still pass validation.

Here's the XSD implementation:

<xs:simpleType name="Tokenized500Type">
    <xs:restriction base="xs:token">
      <xs:maxLength value="500"/>
      <xs:minLength value="1"/>
      <xs:pattern value="\S+( \S+)*"/>
    </xs:restriction>
  </xs:simpleType>

Is there some feature of

  • XML

or

  • XSD

or

  • oXygen XML Editor

that prevents this working?

Michael
  • 285
  • 5
  • 14
  • If you are using a non XML Schema regex, you need `^\S+(\s\S+)*$`. In an XML Schema regex, the anchors are not necessary - `\S+(\s\S+)*` – Wiktor Stribiżew Oct 31 '16 at 20:22
  • Thanks, it's in the context of XSD validation so I used the usual XML Schema syntax without ^ and $. Can you see why my longer version above does not work? And how would `\S+(\s\S+)*` exclude e.g. line breaks and tabs? The `\s` includes both `\n` and `\t` – Michael Nov 01 '16 at 07:55
  • Hi @WiktorStribiżew - I think that this is the regex I need, thanks for making it less verbose: `\S+( \S+)*` - please note the singe deliberate literal "space" character. – Michael Nov 01 '16 at 08:09
  • Hello @WiktorStribiżew - I like the less verbose regex, but my original question was not focussed on the (working) regex but on the failing XSD implmentation of it. That problem persists with the new regex. – Michael Nov 01 '16 at 08:12
  • xsd:token cannot contain any spaces. tsd:tokens (plural), but not xsd:token. – cco Nov 01 '16 at 08:50
  • 1
    Yes, I meant to write `\S+( \S+)*`, but I was too distracted by my children. I will post then. – Wiktor Stribiżew Nov 01 '16 at 09:01
  • I plus-onned because it was confirmation of what I thought, not only because I definitely know the feeling. – Michael Nov 01 '16 at 13:10

2 Answers2

2

The base type needs to be xsd:string.

Using xsd:Token tokenizes the input, THEN checks if it's a token. That is redundant.

Michael
  • 285
  • 5
  • 14
1

Your original ([^\s])+( [^\s]+)*([^\s])* regex contains some redundant patterns: it matches and captures each iteration of 1+ non-whitespaces, then matches 0+ sequences of space and 1+ non-whitespaces, and then again tries to match and capture each iteration of a non-whitespace.

You may use a similar, but shorter

\S+( \S+)*

Since XML Schema regex is anchored by default, there expression matches:

  • \S+ - one or more chars other than whitespace, specifically &#20; (space), \t (tab), \n (newline) and \r (return)
  • ( \S+)* - zero or more sequences of a space and 1+ whitespaces.

This expression disallows duplicate consecutive spaces and no spaces at leading/trailing position.

Here is how the regex should be used:

<xs:simpleType name="Tokenized500Type">
  <xs:restriction base="xs:string">
    <xs:pattern value="\S+( \S+)*"/>
    <xs:maxLength value="500"/>
    <xs:minLength value="1"/>
  </xs:restriction>
</xs:simpleType>
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I tried this but for some reason the "live" regex in the XSD (now also in the question) just ignores this... ohhhhhhhhhhhhh.... because it's tokenized and THEN checked if it's a token (hint: it's been tokenized)?! – Michael Nov 01 '16 at 13:14
  • 1
    I added the code how it should be used into the answer. – Wiktor Stribiżew Nov 01 '16 at 13:20