2

I want a regular expression (as efficient as possible because i use C++ and the engine there isn't that efficient) to match any string that contains a % not followed immediately by:

1) a letter [a-zA-Z]

or

2) .NUMBERS[a-zA-Z]

or

3) NUMBERS[a-zA-Z]

So i want to match strings like these: "dsfdf (%) dsfgs %d s32523", "%d %d % %t dsg"

And i don't want to match string like these: "sfsf %d", "dfsd %.464d, %353T"

darkThoughts
  • 403
  • 6
  • 18

1 Answers1

2

Use negative look-ahead expression:

Negative lookahead is indispensable if you want to match something not followed by something else: q(?!u) means q not followed by u

In your case q is %, and u is ([.]?[0-9]+)?[a-zA-Z] (an optional prefix of an optional dot followed by one or more numbers, and a letter suffix).

Demo 1

Note: This expression uses + in the look-ahead section, a feature that does not have universal support. If your regex engine does not take it, set an artificial limit of, say, 20 digits by replacing [0-9]+ with [0-9]{1,20}.

Edit:

What about writing my own parser?

If you need the ultimate speed for this relatively simple regex, use a hand-written parser. Here is a quick example:

for (string str ; getline(cin, str) ; ) {
    bool found = false;
    size_t pos = 0;
    while (!found && (pos = str.find('%', pos)) != string::npos) {
        if (++pos == str.size()) {
            found = true;
            break;
        }
        if (str[pos] == '.') {
            if (++pos == str.size()) {
                found = true;
                break;
            }
            if (!isdigit(str[pos])) {
                found = true;
                break;
            }
        }
        while (isdigit(str[pos])) {
            if (++pos == str.size()) {
                found = true;
                break;
            }               
        }
        found |= !isalpha(str[pos]);
    }
    cout << '"' << str << '"' << " : " << found << endl;
}

Demo 2

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • thanks it works correctly. The problem is C++ regex engine is still too slow . When i use the cli C++ (aka .NET) regex the program finishes in less then 6 seconds and when i use this regex in ANSI C++ the program finishes in 43 seconds. The code in ANSI C++ i use is std::regex `reg=std::regex("^.*%(?!([.]?[0-9]+)?[a-zA-Z]).*$"); bool answer = std::regex_search(str, reg)`. Is there a way to optimize it and speed it up? – darkThoughts Jul 11 '17 at 12:06
  • By the way, it's still much faster then when i used `std::regex_match` with the regex: `"^.*%(?!([.]?[0-9]+)?[a-zA-Z]).*$"`. That tool over 3 minutes. – darkThoughts Jul 11 '17 at 12:07
  • @darkThoughts Try removing `^` and `$` anchors, and match `%(?!([.]?[0-9]+)?[a-zA-Z])` one string at a time. – Sergey Kalinichenko Jul 11 '17 at 12:23
  • that helped a bit thanks. But still much slower then .NET regex. Is there another implementation i can use to maybe speed things up? – darkThoughts Jul 11 '17 at 12:44
  • @darkThoughts You may want to try a limited `%(?!([.]?[0-9]{1,20})?[a-zA-Z])` expression, or give boost:regex a try, but trying to match .NET speed is not an easy task, [because its engine has been very well tuned over many years](https://stackoverflow.com/q/19798653/335858). – Sergey Kalinichenko Jul 11 '17 at 12:57
  • thanks i will try boost. What about writing my own parser? i was given this suggestion and that it may be much much faster then regex. – darkThoughts Jul 11 '17 at 13:04