-1

I'm trying to validate a restricted string using a regular expression ...

<xs:simpleType name="myStringType">
    <xs:restriction base="xs:string">
        <xs:pattern value="^urn:mystuff:v1:(ABC\.(?!Acme).\S+\.\S+\.a\d+\.v\d+|ABC\.Acme\.\S+\.a\d+\.\d+\.\d+)$"/>
    </xs:restriction>
</xs:simpleType>

As you can see the regular expression I'm trying to use is

^urn:mystuff:v1:(ABC\.(?!Acme).\S+\.\S+\.a\d+\.v\d+|ABC\.Acme\.\S+\.a\d+\.\d+\.\d+)$

I would like the following to validate:

urn:mystuff:v1:ABC.Test.MyData.a1.v1
urn:mystuff:v1:ABC.Acme.MyData.a1.0.1

But I would like the following to fail

urn:mystuff:v1:ABC.Acme.MyData.a1.v1

This appears to work fine in an online regex tester but when I use Oxygen XML Editor I get the following error.

 Pattern value '^urn:mystuff:v1:(ABC\.(?!Acme).\S+\.\S+\.a\d+\.v\d+|ABC\.Acme\.\S+\.a\d+\.\d+\.\d+)$' is not a valid regular expression. The reported error was: 'This expression is not supported in the current option setting.'.

This post suggests that lookaheads and lookbehinds are not supported in XSD regex but the question relates to number patterns so a brute force approach is taken in the example. This is possible because there's a very limited subset of possibilities.

How does one deal with this when the values to be disallowed is a specific string?

agf1997
  • 2,668
  • 4
  • 21
  • 36
  • To clear things up a bit, the dot in this sequence `(?!Acme).\S+` is a literal or a metacharacter ? Or, is it a typo that shouldn't even be there ? –  Dec 14 '19 at 16:49
  • Literal. The examples so the pattern – agf1997 Dec 14 '19 at 16:53
  • @x15 did you delete your answer? – agf1997 Dec 14 '19 at 17:15
  • @x15 ah. Thanks. Went to go give it a try and poof. Thanks for thinking about it – agf1997 Dec 14 '19 at 17:21
  • This one works... –  Dec 14 '19 at 17:33
  • Did you try this one out ? –  Dec 14 '19 at 17:42
  • @kjhughes this question has been revised to clarify and differentiate. I also specifically noted how the referenced answer did not answer the question. Clearly people agree and have something to add. Please reopen. – agf1997 Dec 14 '19 at 18:25
  • @agf1997: Alright. Reopened. Sorry those didn't help. – kjhughes Dec 14 '19 at 18:27
  • **Related** (migrated from former close heading): (a) [XSD restriction that negates a matching string](https://stackoverflow.com/q/9889206/290085) (b) [XML schema restriction pattern for not allowing specific string](https://stackoverflow.com/q/37563199/290085) (c) [XML Regex - Negative match](https://stackoverflow.com/q/38436165/290085) – kjhughes Dec 14 '19 at 18:31
  • 1
    This specific focused question with regex is extremely complex. I don't even like to think of it, although I have many times. And I consider myself an expert. This question happens to have a solution. –  Dec 14 '19 at 18:35
  • **_Note this word of caution ;_** It would be a stretch to assume that these assertion simulations are _covered ground_. I would consider each one a uniquely extraordinary question and answer ! –  Dec 15 '19 at 19:29
  • 1
    Agree. The other questions had useful tidbits but didn’t really provide a workable answer for this question as it has its own unique challenges. – agf1997 Dec 15 '19 at 19:31

3 Answers3

2

XSD has a particular definition of what it accepts in regular expression, and it rather more restrictive than many other regular expression dialects. I think the intention of the designers was to use a "common subset" of popular regex dialects so that it could be easily implemented on any platform. You are using constructs like (?! ... ) and (?: ... )that aren't defined in this subset. So is the answer from @x15, unfortunately.

Telling you why your attempt isn't working is easy, finding an alternative that does work is harder. I would go for the easy option which is to use an XSD 1.1 assertion like test="matches($value, XX) or matches($value, YY) and not(matches($value, ZZ))". A solution using pure XSD 1.0 might be possible, but I can't immediately see it.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

adendum : Note that this solution plants an pseudo assertion at a fixed location in the string.
For an example solution of an assertion that should span the entire string
see this question XML schema restriction pattern for not allowing specific string


edit : As pointed out in a comment, use (..) instead of (?:..) if that is the only
supported construct.
Changed !


This series (?!Acme)\S+\. can be replaced with this large series :

([^A]\S*|A([^c.]\S*)?|Ac([^m.]\S*)?|Acm([^e.]\S*)?)\.

which is bigger but should cover all cases and makes the regex now :

urn:mystuff:v1:(ABC\.([^A]\S*|A([^c.]\S*)?|Ac([^m.]\S*)?|Acm([^e.]\S*)?)\.\S+\.a\d+\.v\d+|ABC\.Acme\.\S+\.a\d+\.\d+\.\d+)

https://regex101.com/r/qXv9HU/2

Expanded

 urn:mystuff:v1:
 (                             # (1 start)
      ABC \. 
      (                             # (2 start)
           [^A]  \S* 
        |  A 
           ( [^c.] \S* )?                # (3)
        |  Ac 
           ( [^m.] \S* )?                # (4)
        |  Acm  
           ( [^e.] \S* )?                # (5)
      )                             # (2 end)
      \. 
      \S+ \. a \d+ \. v \d+ 
   |  
      ABC \. Acme \. \S+ \. a \d+ \. \d+ \. \d+ 
 )                             # (1 end)
0

The simplest way would be to exploit this rule in the XML Schem specification:

If multiple element information items appear as children of a <simpleType>, the values should be combined as if they appeared in a single regular expression as separate branches. Note: It is a consequence of the schema representation constraint Multiple patterns (§4.3.4.3) and of the rules for restriction that pattern facets specified on the same step in a type derivation are ORed together, while pattern facets specified on different steps of a type derivation are ANDed together.

Instead of trying to match both allowed patterns with a single regex, specify two separate pattern facets. That would also extend more naturally if a third, fourth URN pattern is required.

kimbert
  • 2,376
  • 1
  • 10
  • 20
  • That handles AND and OR, but it doesn't immediately provide a way of doing AND NOT. – Michael Kay Dec 14 '19 at 18:09
  • On further reflection, I believe my suggested approach would work for the scenarios described in the question. The string `urn:mystuff:v1:ABC.Acme.MyData.a1.v1` will not match either of the regexes, so the 'AND NOT' is not required. Unless I'm missing something. – kimbert Dec 16 '19 at 09:46
  • One of the 2 regex used an assertion `(?!Acme)` at a location and was in error via _unsupported_ construct. The NOT is needed because a specific item needs to be not. Together all these conditions must all be true **(** `urn:mystuff:v1:ABC\.\S+\.\S+\.a\d+\.v\d+` AND NOT `urn:mystuff:v1:ABC\.Acme\S*\.\S+\.a\d+\.v\d+` **)** OR **(** `urn:mystuff:v1:ABC\.Acme\.\S+\.a\d+\.\d+\.\d+` **)** –  Dec 16 '19 at 20:46