How to check a list of regex faster

Question

I'm using regex in C where I'm checking a word against a list of regex, below is what I have,

I have this string

0118889994444

and I have these regexes

^012[0-9]{10}$ if this one hits then do 1
^0011[0-9]{10}$ if this one hits then do 2
^00[0-9]{10}$ if if this one hits then do 3  
^11[0-9]{10}$ if this one hits then do 4
^011[0-9]{10}$ if this one hits then do 5 // this one will match the string

What I'm currently doing is looping through the regex list and see which one will hit and then do whatever is set for that regex, so, the bigger the list the more time it takes to finish the loop, is there a way or a trick to make this faster and more intelligent :) ?

i assume you do have to check against all regex, unless you have a magical trick that can tell you beforehand which regex is going to match ? — njzk2, Dec 11 '13 at 21:02
apparently your regexes all define a fixed length, so may be you can filter from that. — njzk2, Dec 11 '13 at 21:03
you could also split things and test each character individually (like, if it starts with 1, test only the regex 4, else if second character is 0, test only 2 and 3...) — njzk2, Dec 11 '13 at 21:04
Actually the first question I had in mind was "Is there a way to find the closest regexes that might hit" which could be the solution for that but I'm not sure yet how to do it — diameter de, Dec 11 '13 at 21:06
you can make up rediculous optimizations making this faster and faster, but you will probably end up with something nobody can understand or maintain. Is the speed of this tiny thing really an issue? — mvds, Dec 11 '13 at 21:06
the above is an example so the length won't be always the same, but yeah that's right I'm thinking of having multiple stages of regex — diameter de, Dec 11 '13 at 21:07
@mvds, I agree, I'm just worried if I end up with huge list of regex. — diameter de, Dec 11 '13 at 21:08
What about checking the beginning of the string for unique bits (01, 00, 11) and then splitting the regex accordingly? It wouldn't save you creating the same number of expressions, but the code would omit looping through the ones that don't start with the right match. That said, you're using C, so that would require: http://stackoverflow.com/questions/4770985/something-like-startswithstr-a-str-b-in-c :| — brandonscript, Dec 11 '13 at 21:12
@diameterde you do know that asking such a question eventually involves you benchmarking all proposed solutions, right? ;-) — mvds, Dec 11 '13 at 21:32
if you want to optimize it, may be you should take out the things that repeat, first checking [0-9]{10}, and then a decision tree with if(^0) if(^00) .. if(^01).. etc — fernando.reyes, Dec 11 '13 at 21:35

score 1 · Answer 1 · answered Dec 11 '13 at 21:15

In the case above I would drop regex altogether, and go for a straightforward approach of checking the prefix against a fixed list, followed by the detection that the rest of the string is composed of ten digit. You can do it like this:

struct mathc_def {
    const char *prefix;
    int offset;
} match_defs[] = {
    {.prefix = "012",  .offset = 3}
,   {.prefix = "0011", .offset = 4}
,   {.prefix = "00",   .offset = 2}
,   {.prefix = "11",   .offset = 2}
,   {.prefix = "011",  .offset = 3}
};

bool ten_digits(const char* str) {
    int i = 0;
    while (isdigit(str[i])) {
        i++;
    }
    return i == 10;
}

char *str = "0118889994444";
for (int i = 0 ; i != 5 ; i++) {
    if (strstr(str, match_defs[i].prefix) == str && ten_digits(&str[match_defs[i].offset])) {
        printf("Item %d matched.\n", i);
    }
}

+1, and instead of a linear list you could consider a simple tree, particularly when the list is much longer. (What sort of structure syntax is this, BTW?) — Jongware, Dec 11 '13 at 21:22
@Jongware That's [*designated initializers*](http://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html), they were introduced in C99. — Sergey Kalinichenko, Dec 11 '13 at 21:24

mvds · Answer 2 · 2013-12-11T21:29:42.163

Not sure if it will be the fastest approach, but you could convert the (decimal) string into a 64 bit long and check the value divided by 1e10. Apart from that, check the length of the string to verify the leading zeroes.

This will of course only work as long as the characters are in the [0-9] range.

Example, using uintmax_t as integer type:

    uintmax_t val,prefix;
    char *endp=NULL;
    int len, prefixlen;

    printf("sizeof val: %lu\n",sizeof(val));

    val = strtoumax(argv[1],&endp,10);
    prefix = val / 1e10;

    len = endp - argv[1];
    prefixlen = len - 10;

    printf("val=%ju, len=%u\n",val,len);
    printf("prefix=%ju, len=%u\n",prefix,prefixlen);

    switch ( prefixlen )
    {
            case 2:
                    if ( prefix == 0 ) ; // do 3
                    if ( prefix == 11 ) ; // do 4
                    break;
            case 3:
                    if ( prefix == 12 ) ; // do 1
                    if ( prefix == 11 ) ; // do 5
                    break;
            case 4:
                    if ( prefix == 11 ) ; // do 2
                    break;
    }

This is probably fast compared to any solution comparing actual strings (just one loop to parse the number), but hard to maintain or expand to non-decimal strings or values exceeding 2^64.

How to check a list of regex faster

2 Answers2