Regular Expression for set notation style range validators

Question

I am implementing a multiple range validator that validates the user input against multiple ranges at once.

Example

If I define the range [1.5, 4] | [6.9, 9.3) | [10, 11], it would read like: from 1 to 4, or from 6 to 9 not inclusive. In this case the numbers 5, 9.3, 13 would not be included in this range, but the numbers 2, 7 and 10 would be.

The expression can have as many ranges as possible formatted in the way:

[ or (

Float or Integer

,

Float or Integer

] or )

Expression Syntax

Essentially, this range expression is made of:

'[' inclusive start of range

'] inclusive end of range

'(' exlusive start of range

')' exclusive end of range

',' value separator

'|' OR statement

Now, I need a way to validate that the user has written a correct range expression. If the parser reads an incorrectly formatted expression, the parser will throw an exception. I would like to create a regular expression to very elegantly validate the user inputted expression. My experience with regular expressions is not that broad and this seems like a pretty hard one to craft. I would love some guidance on how to construct a complicated regular expression like this that has integers, doubles, '[', etc.

Although regular expressions are platform agnostic, I am working with .NET.

Just a thought - I tend to dislike "don't do that" as an answer to a question - but: Since you say "[i]f the parser reads an incorrectly formatted expression, [it] will throw an exception" - The optimal way IMO, if at all possible, would be to let the parser do its job, catch the exception and use that as a validation result, rather than do your own regex, which may actually end up handing something to the parser that will fail anyway. Having said that, I'll have a look (if someone doesn't beat me to it). — JimmiTh, Jul 12 '13 at 15:10
@JimmiTh - I think the OP is basically wanting to "compile" the expression into a regex. Maybe I'm reading it wrong? — JDB, Jul 12 '13 at 15:23
If I understand correctly, then what you are trying to do will result in a horribly ugly and ineffecient regex. See http://stackoverflow.com/q/17604016/211627. Also, regexes are platform agnostic like C++ is platform agnostic... stray too far from "Hello world" and you quickly run into platform-specific implementations. — JDB, Jul 12 '13 at 15:25
http://stackoverflow.com/questions/4098086/to-use-or-not-to-use-regular-expressions/4098123 may be informative. — Dour High Arch, Jul 12 '13 at 19:20
Another vote for "let the parser handle its own validation". If you're writing the parser yourself, and this /is/ the validation -- then @JimmiTh's answer is what you're looking for. Otherwise, just catch the exception. — zebediah49, Jul 12 '13 at 19:21

JimmiTh · Accepted Answer · 2013-07-12T16:34:50.347

A quick shot, free-spacing, likely far from efficient (first of all because a single range will make the regex backtrack when it finds out there's not a '|' after it) - see version 2 below, which is (I believe) much more efficient:

^                     # Start of string
(?:                   # Start group, non-capturing
  [([]                # '(' or '['
    \s*               # optional whitespace
    [0-9]+            # at least one digit 0-9
    (?:\.[0-9]+)?     # optionally '.' followed by at least one digit 0-9
    \s*               # optional whitespace
  ,                   # ','
    \s*               # optional whitespace
    [0-9]+            # at least one digit 0-9
    (?:\.[0-9]+)?     # optionally '.' followed by at least one digit 0-9
    \s*               # optional whitespace
  [)\]]               # ')' or ']'

  \s*                 # optional whitespace
  \|                  # '|'
  \s*                 # optional whitespace
)*                    # all the above may appear 0 or more times

[([]                  # The remainder is exactly the same as the group above, 
 \s*                  # used for a single range or the last range - 
  [0-9]+              # i.e., a range NOT followed by '|' - of a multi range.
  (?:\.[0-9]+)?
  \s*
,
  \s*
  [0-9]+
  (?:\.[0-9]+)?
  \s*
[)\]]
$                     # end of string

This will match e.g.:

[1.5, 3]
[23.7, 3.70)
[2.9 , 3]|[3,2)
[1.5, 4] | [6.9, 9.3) | [10, 11]
(1.5, 3]
[23.7, 3.70)
(1.5, 5.0)

But not:

[23.7, 3.70) | (7, 9) |     // trailing OR
| [23.7, 3.7]               // leading OR

Note that it doesn't ensure that the second number is actually higher than the first. For that, I really recommend leaving it to the/a parser - or add capturing groups and process them outside regex.

VERSION 2

This should be more efficient due to less backtracking - it's basically changed from:

(any number of ranges followed by |) followed by a range

... to:

a range followed by (any number of ranges preceded by |)

ETA: To explain, version 1 starts out checking for "a range followed by |".

If we only have a single range, that's wasted time. When it gets to the "|" it will start over, checking the second part of the regex - i.e. is there, instead, the required "range without |"?

In version 2, instead, we start out checking for simply "a range". That means, if there's only one range, it will succeed without any backtracking. If we give it gibberish, e.g. hello, it will fail immediately, because it now knows that the first character must be a ( or [ - it's not optional. Whereas in version 1, because the first part was optional, it would then have to check the second part of the regex to make sure that failed too.

In just about every other case (that I've tested) version 2 matches - or fails to match - in fewer steps.

Here, since it's basically the same regex with some parts switched, I'll instead put an example match in the comments:

^
[([]                 # (
  \s*                #
  [0-9]+             # 3
  (?:\.[0-9]+)?      # .90
  \s*                #
,                    # ,
  \s*                #
  [0-9]+             # 43
  (?:\.[0-9]+)?      # .2
  \s*                #
[)\]]                # ]
                     #
(?:                  #
  \s*                #
  \|                 # |
  \s*                #
                     #
  [([]               # [
    \s*              #
    [0-9]+           # 55
    (?:\.[0-9]+)?    # .20
    \s*              #
  ,                  # ,
    \s*              #
    [0-9]+           # 2
    (?:\.[0-9]+)?    # .91
    \s*              #
  [)\]]              # )
)*
$

Matches and non-matches should be identical to version 1.

This is perfect. My intent was not to let the regex parse it but let the regex do a validation of the expression. The parsing happens later. — Alexander Ventura, Jul 12 '13 at 15:32
Also this is a solid education material to my learning of regex. — Alexander Ventura, Jul 12 '13 at 15:35
Yeah, my point in the comment to the question was really that you should leave as much as possible to the actual parser - possibly even more than this regex does. In the same way that, IMO, the optimal regex email validation is something that checks that there's *something* followed by a `@`, followed by *something*, optionally with a `.` in it - and checks nothing else. I.e. `^.+@.+\..+$` - or even just `^.+@.+$` — JimmiTh, Jul 12 '13 at 15:37
Well, my intent for this regex was to have it validate the expression. Once validated, I would not have to worry about validating the input while parsing. — Alexander Ventura, Jul 12 '13 at 15:39
Added a version 2 - kept version 1 in for learning purposes. :-) — JimmiTh, Jul 12 '13 at 15:56

score 1 · Answer 2 · answered Jul 12 '13 at 15:12

1

If you choice to do it with a regex, you must describe all the possibilities separated with |. Here an example for the first range (you can easily add values for the other ranges):

 @"1\.[5-9]|[23](?:\.[0-9])?|4"

But the most elegant way IMO, is to extract all numbers and test them as numeric after.

answered Jul 12 '13 at 15:12

Casimir et Hippolyte

88,009
5
94
125

How would this work if I had a variable number of ranges? I want to create this so any number of ranges will be okay. – Alexander Ventura Jul 12 '13 at 15:16
I believe the question asks for regex to test "is the input a valid specifier", as opposed to "given a specifier, how can I use regex to test if a given input is within it". – zebediah49 Jul 12 '13 at 19:20

Regular Expression for set notation style range validators

2 Answers2