9

This may be a problem that has already been solved, but I can't figure it out. I have two largish integers, lets call them start_number and end_number (they represent a contiguous block of telephone numbers). Other numbers (represented as strings) will be input into my system and I need to use regex to match this against the 'range regex' to see if the number string falls either on or between start_number and end_number.

For example:

  • start_number = 99519000
  • end_number = 99519099

therefore

  • expression = "^995190[0-9][0-9]$"

so that I can eventually match the following example numbers (which arrive at my system one at a time, and could arrive at any time):

  • "99519000" <- MATCH
  • "99519055" <- MATCH
  • "99519099" <- MATCH
  • "99519100" <- NOT MATCH
  • "99512210" <- NOT MATCH
  • "41234123" <- NOT MATCH

How can I use python to create the regex string pattern "expression" given any reasonable start_number and end_number? I have several start/end number 'blocks' I have to create regex patterns for, I just need a way to make these patterns programatically.

It's fair to assume that:

  • Start_number will always be less than end_number
  • Always will be a positive integer.
  • In my case, start_number and end_number will always be the same 'length' (i.e. always will have the same number of base 10 'characters' when represented as a string), if it'll make life easier.

Edit: for clarity

TC Fox
  • 980
  • 4
  • 13
  • 25
  • Why not matching the right number of digits (`\d{8}`), and then for all matches converted in int check if its in the range ? – Scharron Jul 24 '13 at 09:24
  • 1
    addint to @rnbcoder wrote cant you convert str to int and compare? `int(start_number) <= test_number <= (end_number)` – Srikar Appalaraju Jul 24 '13 at 09:24
  • 1
    Is there any reason this has to be done with a regex? I would just convert the number to an int and check if it is between start_number and end_number. This may be slightly less efficient, but the order of complexity is the same, since checking a regex match is O(n), and converting string to int is O(n) then the two int comparisons are O(1). – Vyassa Baratham Jul 24 '13 at 09:25
  • @SrikarAppal from his code, they are already integer. you can say, test_number may be str, and you have to change it to int. – rnbguy Jul 24 '13 at 09:35
  • If you want to improve your chances to get a good answer to your question you should really include some sample data. Input/Output etc. – Inbar Rose Jul 24 '13 at 09:36
  • Unfortunately regex is a requirement: the `expression` string is put into a config file and parsed by some hardware/firmware this way. I wish there was another way too! :( – TC Fox Jul 24 '13 at 09:52
  • I assume from the OP's description that the regexp is needed to validate input and that a numeric comparison is not available at the validation stage? If this is the case seems like a problem with the validation library really – Vorsprung Jul 24 '13 at 09:52
  • @Inbar Rose: Consider the inputs as `start_number` and `end_number` (both integers) and the output is `expression` (as a string). I know how to match these numbers, I just want a way of programatically generating the regex pattern. – TC Fox Jul 24 '13 at 09:55
  • @TrojanCentaur You didn't understand me. Please add to your answer an example of this "stream of phone numbers" a sample "start" and "end" values and what the output of that should be. This is the best way for you to help yoursef. – Inbar Rose Jul 24 '13 at 09:56
  • @InbarRose: Updated. Just note I need to programatically create the `expression` string specifically, the stream of phone numbers I will be using to match against these blocks is merely provided to add some background to the problem. :) – TC Fox Jul 24 '13 at 10:03
  • Okay, but you understand that it is an import aspect of the solution, are the numbers received one by one? Or is a stream of numbers which you need to divide into chunks of 8 and create phone numbers that do or don't fall within a range? Or something else? Every use-case, or setup has a different answer. The way you added the information makes it seem like you are simply receiving a list of numbers, in which case, regular expressions are unnecessary. – Inbar Rose Jul 24 '13 at 10:11
  • The hardware that uses this unfortunately interprets the numbers as strings, and requires the use of regex to match each number. The numbers arrive one by one and could arrive at any time, hence I can't just use a list. Understand what you mean though, duly noted. :) – TC Fox Jul 24 '13 at 10:35
  • 1
    So then simply whenever you receive a number from your stream, send it to a function that will convert it to an integer, and check if it is within a range - no need for regular expressions. What stream are you using? I would open a new question, as this is another classic example of an [XY Probelm](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) – Inbar Rose Jul 24 '13 at 14:28
  • 1
    frustrating that people here are telling you to use a different solution just because they can't answer the question. what you ask for is possible and seems reasonable given the constraints you have. – andrew cooke Jul 25 '13 at 12:17

2 Answers2

21

[assuming you need this because it's some weird 3rd party system that requires regexp]

New Approach

the more i think about Frederik's comment, the more i agree. the regexp engine should be able to compile this down to a compact DFA, even if the input string is long. for many cases, the following is a sensible solution:

import re

def regexp(lo, hi):
    fmt = '%%0%dd' % len(str(hi))
    return re.compile('(%s)' % '|'.join(fmt % i for i in range(lo, hi+1)))

(it works fine with all the numerical ranges in the tests below, including 99519000 - 99519099. a rough back-of-the-envelope calculation suggests that 9 digit numbers are about the limit with 1GB of memory. that's if most numbers that size are matched; if only a few are matched you can go much larger.).

Old Approach

[updated again to give even shorter results - apart from coalescing the occasional \d\d it's about as good as hand-generated]

assuming all numbers are the same length (ie you zero pad on the left if necessary), this works:

import re

def alt(*args):
    '''format regexp alternatives'''
    if len(args) == 1: return args[0]
    else: return '(%s)' % '|'.join(args)

def replace(s, c): 
     '''replace all characters in a string with a different character'''
    return ''.join(map(lambda x: c, s))

def repeat(s, n):
    '''format a regexp repeat'''
    if n == 0: return ''
    elif n == 1: return s
    else: return '%s{%d}' % (s, n)

def digits(lo, hi): 
    '''format a regexp digit range'''
    if lo == 0 and hi == 9: return r'\d'
    elif lo == hi: return str(lo)
    else: return '[%d-%d]' % (lo, hi)

def trace(f):
    '''for debugging'''
    def wrapped(lo, hi):
        result = f(lo, hi)
        print(lo, hi, result)
        return result
    return wrapped

#@trace  # uncomment to get calls traced to stdout (explains recursion when bug hunting)
def regexp(lo, hi):
    '''generate a regexp that matches integers from lo to hi only.
       assumes that inputs are zero-padded to the length of hi (like phone numbers).
       you probably want to surround with ^ and $ before using.'''

    assert lo <= hi
    assert lo >= 0

    slo, shi = str(lo), str(hi)
    # zero-pad to same length
    while len(slo) < len(shi): slo = '0' + slo
    # first digits and length
    l, h, n = int(slo[0]), int(shi[0]), len(slo)

    if l == h:
        # extract common prefix
        common = ''
        while slo and slo[0] == shi[0]:
            common += slo[0]
            slo, shi = slo[1:], shi[1:]
        if slo: return common + regexp(int(slo), int(shi))
        else: return common

    else:
        # the core of the routine.
        # split into 'complete blocks' like 200-599 and 'edge cases' like 123-199
        # and handle each separately.

        # are these complete blocks?
        xlo = slo[1:] == replace(slo[1:], '0')
        xhi = shi[1:] == replace(shi[1:], '9')

        # edges of possible complete blocks
        mlo = int(slo[0] + replace(slo[1:], '9'))
        mhi = int(shi[0] + replace(shi[1:], '0'))

        if xlo:
            if xhi:
                # complete block on both sides
                # this is where single digits are finally handled, too.
                return digits(l, h) + repeat('\d', n-1)
            else:
                # complete block to mhi, plus extra on hi side
                prefix = '' if l or h-1 else '0'
                return alt(prefix + regexp(lo, mhi-1), regexp(mhi, hi))
        else:
            prefix = '' if l else '0'
            if xhi:
                # complete block on hi side plus extra on lo
                return alt(prefix + regexp(lo, mlo), regexp(mlo+1, hi))
            else:
                # neither side complete, so add extra on both sides
                # (and maybe a complete block in the middle, if room)
                if mlo + 1 == mhi:
                    return alt(prefix + regexp(lo, mlo), regexp(mhi, hi))
                else:
                    return alt(prefix + regexp(lo, mlo), regexp(mlo+1, mhi-1), regexp(mhi, hi))


# test a bunch of different ranges
for (lo, hi) in [(0, 0), (0, 1), (0, 2), (0, 9), (0, 10), (0, 11), (0, 101),
                 (1, 1), (1, 2), (1, 9), (1, 10), (1, 11), (1, 101),
                 (0, 123), (111, 123), (123, 222), (123, 333), (123, 444),
                 (0, 321), (111, 321), (222, 321), (321, 333), (321, 444),
                 (123, 321), (111, 121), (121, 222), (1234, 4321), (0, 999),
                 (99519000, 99519099)]:
    fmt = '%%0%dd' % len(str(hi))
    rx = regexp(lo, hi)
    print('%4s - %-4s  %s' % (fmt % lo, fmt % hi, rx))
    m = re.compile('^%s$' % rx)
    for i in range(0, 1+int(replace(str(hi), '9'))):
        if m.match(fmt % i):
            assert lo <= i <= hi, i
        else:
            assert i < lo or i > hi, i

the function regexp(lo, hi) builds a regexp that matches values between lo and hi (zero padded to the maximum length). you probably need to put a ^ before and a $ after (as in the test code) to force the match to be the whole string.

the algorithm is actually quite simple - it recursively divides things into common prefixes and "complete blocks". a complete block is something like 200-599 and can be matched reliably (in this case with [2-5]\d{2}).

so 123-599 is split into 123-199 and 200-599. the last half is a complete block, the first half has a common prefix of 1 and 23-99, which is recursively handled as 23-29 (common prefix) and 30-99 (complete block) (and we terminate eventually, because arguments to each call are shorter than the initial input).

the only nasty detail is the prefix, which is needed because the arguments to regexp() are integers, so when called to generate, say, the regexp for 00-09 it actually generates the regexp for 0-9, without the leading 0.

the output is a bunch of test cases, showing the range and regexp:

   0 - 0     0
   0 - 1     [0-1]
   0 - 2     [0-2]
   0 - 9     \d
  00 - 10    (0\d|10)
  00 - 11    (0\d|1[0-1])
 000 - 101   (0\d\d|10[0-1])
   1 - 1     1
   1 - 2     [1-2]
   1 - 9     [1-9]
  01 - 10    (0[1-9]|10)
  01 - 11    (0[1-9]|1[0-1])
 001 - 101   (0(0[1-9]|[1-9]\d)|10[0-1])
 000 - 123   (0\d\d|1([0-1]\d|2[0-3]))
 111 - 123   1(1[1-9]|2[0-3])
 123 - 222   (1(2[3-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
 123 - 333   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-2]\d|3[0-3]))
 123 - 444   (1(2[3-9]|[3-9]\d)|[2-3]\d{2}|4([0-3]\d|4[0-4]))
 000 - 321   ([0-2]\d{2}|3([0-1]\d|2[0-1]))
 111 - 321   (1(1[1-9]|[2-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
 222 - 321   (2(2[2-9]|[3-9]\d)|3([0-1]\d|2[0-1]))
 321 - 333   3(2[1-9]|3[0-3])
 321 - 444   (3(2[1-9]|[3-9]\d)|4([0-3]\d|4[0-4]))
 123 - 321   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
 111 - 121   1(1[1-9]|2[0-1])
 121 - 222   (1(2[1-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
1234 - 4321  (1(2(3[4-9]|[4-9]\d)|[3-9]\d{2})|[2-3]\d{3}|4([0-2]\d{2}|3([0-1]\d|2[0-1])))
 000 - 999   \d\d{2}
99519000 - 99519099  995190\d\d

it takes a while to run as the last test loops over 99999999 numbers.

the expressions should be compact enough to avoid any buffer limits (i would guess that memory size worst case is proportional to the square of the number of digits in the largest number).

ps i am using python 3, but i don't think it makes much difference here.

andrew cooke
  • 45,717
  • 10
  • 93
  • 143
  • 1
    As a footnote, Python's regexp machinery does some of this automatically, try e.g. `re.sre_parse.parse("|".join(str(i) for i in range(99519000,99519100)))` and you'll see how it pulls out the shared prefix. I don't think the compiler is smart enough to map e.g. 0-9 to \d, though. – Fredrik Jul 24 '13 at 18:15
  • 1
    Good luck debugging that – mbatchkarov Jul 25 '13 at 12:24
  • is there something you don't understand? i've added comments to the code, but i'm happy to explain more if there's some point you're not getting (you don't need to debug the regexps - it's the code you need to worry about). – andrew cooke Jul 25 '13 at 13:03
  • 2
    This is a fantastically impressive response! Really appreciate the work that went into this, seriously! – TC Fox Jul 27 '13 at 09:47
  • I've run a few tests and memory-wise, the "new approach" is not nearly as efficient as the "old approach". This starts becoming apparent when there are huge differences in the ranges. Don't have a proof of this, but I suspect the "new approach" may have a O(N) component where N is the difference in ranges. Just notice how large the following is: `re.sre_parse.parse("|".join(str(i) for i in range(10000, 40000)))` – speedplane Sep 23 '19 at 05:14
  • @andrewcooke would you mind if I take your code, add some modifications and put it on github? I'll credit you if you provide your github handle and I'll release under an Apache license or similar. – speedplane Sep 26 '19 at 21:06
3

Use the python package regex_engine for generate regular expressions for numerical ranges

You can install this package by using pip

pip install regex-engine

from regex_engine import generator

generate = generator()

regex = generate.numerical_range(99519000, 99519099)

print(regex)

^(995190[1-8][0-9]|9951900[0-9]|9951909[0-9])$

You can also generate regexes for floating point and negative ranges

from regex_engine import generator

generate = generator()

regex1 = generate.numerical_range(5,89)
regex2 = generate.numerical_range(81.78,250.23)
regex3 = generate.numerical_range(-65,12)
ASHWIN RAJEEV
  • 2,525
  • 1
  • 18
  • 24