3

I'd like to generate a (series of) regexp(s) from a numeric range.

Example:

1013 - 4044 => 

regexp                      matches
---------------------------------------
101[3-9]                    1013 - 1019
10[2-9][0-9]                1020 - 1099
11[0-9][0-9]                1100 - 1199
[23][0-9][0-9][0-9]         2000 - 3999
40[0-3][0-9]                4000 - 4039
404[0-4]                    4040 - 4044

what is the simplest algorithm?

What is the easiest way to reverse it (i.e. given the regexps, looking for the ranges)?

Would be nice to see solutions in java, clojure, perl...

Thanks!

Wrikken
  • 69,272
  • 8
  • 97
  • 136
Matias
  • 31
  • 1
  • 2
  • 2
    Just five hours ago, someone posted a great answer on what can and should be done with regexes and what shouldn't: http://stackoverflow.com/questions/4098086/to-use-or-not-to-use-regular-expressions/4098123#4098123 (Spoiler: This falls into the latter category) –  Nov 04 '10 at 20:34
  • ^ Like delnan says, plus you forgot the `1200-1999` range. You're better off plucking out all integer sequences from a textblob, and process them further on with something else. – Wrikken Nov 04 '10 at 20:44

1 Answers1

2

There is an online tool for generating regex given a range, and provides the explanation. You can find the source code there also. For example:

^(101[3-9]|10[2-9][0-9]|1[1-9][0-9]{2}|[23][0-9]{3}|40[0-3][0-9]|404[0-4])$
First, break into equal length ranges:
  1013 - 4044

Second, break into ranges that yield simple regexes:
  1013 - 1019
  1020 - 1099
  1100 - 1999
  2000 - 3999
  4000 - 4039
  4040 - 4044

Turn each range into a regex:
  101[3-9]
  10[2-9][0-9]
  1[1-9][0-9]{2}
  [23][0-9]{3}
  40[0-3][0-9]
  404[0-4]

Collapse adjacent powers of 10:
  101[3-9]
  10[2-9][0-9]
  1[1-9][0-9]{2}
  [23][0-9]{3}
  40[0-3][0-9]
  404[0-4]

Combining the regexes above yields:
  (101[3-9]|10[2-9][0-9]|1[1-9][0-9]{2}|[23][0-9]{3}|40[0-3][0-9]|404[0-4])

Next we'll try factoring out common prefixes using a tree:
Parse into tree based on regex prefixes:
  . 1 0 1 [3-9]
      + [2-9] [0-9]
    + [1-9] [0-9]{2}
  + [23] [0-9]{3}
  + 4 0 [0-3] [0-9]
      + 4 [0-4]

Turning the parse tree into a regex yields:
  (1(0(1[3-9]|[2-9][0-9])|[1-9][0-9]{2})|[23][0-9]{3}|40([0-3][0-9]|4[0-4]))

We choose the shorter one as our result.

^(101[3-9]|10[2-9][0-9]|1[1-9][0-9]{2}|[23][0-9]{3}|40[0-3][0-9]|404[0-4])$

To reverse it, you can look at the character classes, and get the minimum and maximum for each alternative.

   ^(101[3-9]|10[2-9][0-9]|1[1-9][0-9]{2}|[23][0-9]{3}|40[0-3][0-9]|404[0-4])$
=>   1013     1020         1100            2000        4000         4040     lowers
        1019         1999        1199         3999            4039     4044  uppers

=> 1013 - 4044
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • 2
    online tool require authentication, do you have any ? – Gaurav Shah Dec 16 '15 at 09:24
  • Could not get access to that online tool too, it requires authentication... But according to this [script](http://code.activestate.com/recipes/534137-generate-a-regular-expression-to-match-an-arbitrar/), this is the same as it was supposed to be available in the referenced tool – herrera Jul 10 '16 at 04:47