0

I have a regex as follows /^(\s*\(*\s*[0-9]*\s*[+\-*/]?\s*[0-9]*\s*\)*\s*)*$/ it's not perfect, but it's designed as a fast check that the string being input is a basic math formula, e.g., 7 * 9 * (6 + 5). I do secondary checks if this passes to look for unclosed parenthesis (the formula is also allowed to end on an operator).

Javascript's String.prototype.match is able to use the regex to very quickly match against some strings, e.g., "7 * 912 + 6 + 7 +".match(regex) - but is very slow against others:

  1. "7 * 912 + 6 + $ +".match(regex)
  2. "7 * 912 + 6 + ^ +".match(regex)
  3. "7 * 912 + 6 + [ +".match(regex)
  4. "7 * 912 + 6 + ] +".match(regex)

It is however fast for: "7 * 912 + $ +".match(regex)

Presumably the reason for this is that the string contains special chars, combined with the ? in the middle (the problem goes away if I remove that) - but only when a certain number of operators are combined? Is there a way to improve the performance of this? Right now I'm just pre-checking for any special chars (since none are allowed anyway) but I'd like something cleaner.

Zack Newsham
  • 2,810
  • 1
  • 23
  • 43
  • See my answer at https://stackoverflow.com/a/59580435/2191572 for validating basic math input. It does not handle parenthesis though but you can easily add a basic optional parenthesis check and optional whitespace. – MonkeyZeus Jan 08 '20 at 16:14
  • @MonkeyZeus wouldn't your answer have the same problem? The combination of (...+)* seems to be the issue if I understand correctly. – Zack Newsham Jan 08 '20 at 16:44
  • *it's not perfect* - it is awful. Never use optional patterns with `*` quantifier that may match at the same location in a string. – Wiktor Stribiżew Jan 08 '20 at 16:46
  • 1
    Regarding catastrophic backtracking: https://stackoverflow.com/questions/29751230/regex-pattern-catastrophic-backtracking –  Jan 08 '20 at 16:47
  • @WiktorStribiżew - was referring to the correctness, not the performance - I know the performance is terrible :( – Zack Newsham Jan 08 '20 at 16:49
  • Correctness has nothing to do here, a regex is correct or not only as per its specs, you provided none, just some examples. Here, the regex is written in a very poor way, not following best practices, hence, you get CA. Whether or not the string contains special chars or not is not important here. – Wiktor Stribiżew Jan 08 '20 at 16:52
  • @WiktorStribiżew - I see your point, I'd been testing it with matching strings and found that no matter the length, it was performant, but I now see that regardless of the character, if it does not match, it is slow. – Zack Newsham Jan 08 '20 at 16:56
  • That is what catastrophic backtracking does, it only shows itself with non-matching strings. The only way to solve the problem is **re-write the pattern** or **change the approach**, but to do that, one should understand the exact specs. – Wiktor Stribiżew Jan 08 '20 at 16:57
  • @WiktorStribiżew, I'd be interested to see a better pattern for this - the nature of the task seems to require the combination of quantifiers that causes the problem, matching any length of numbers (plus period), followed by an operator, then any length of numbers, and all of that repeated any number of times - and thats without considering whitespace (which I could remove in advance) or parenthesis. – Zack Newsham Jan 08 '20 at 17:00
  • Probably, [this regex](https://regex101.com/r/Vz20LS/1) will do what you need. – Wiktor Stribiżew Jan 08 '20 at 17:12
  • @WiktorStribiżew - thanks very much, but it only supports an odd number of operators :( `7 * 912 + 6 + 5 + 5` doesnt work - it seems the big change is the addition of `?:` to the groups, do you know why making a group non-capturing would solve the performance issue? – Zack Newsham Jan 08 '20 at 17:32
  • See, it happens because your requirements are "I want to match strings like that" instead of "1. Strings starting with... 2. .... 3...". Try `^\s*\d+(?:\.\d+)?(?:\s*[+*\/-]\s*\d+(?:\.\d+)?)*\s*$`, see [demo](https://regex101.com/r/y05dpv/1) – Wiktor Stribiżew Jan 08 '20 at 18:11
  • @WiktorStribiżew thanks, though I did say "basic math formula" - nothing about a specific number of instructions. Anyway, thanks for your yelp – Zack Newsham Jan 08 '20 at 18:15
  • No, my regex does not have catastrophic backtracking like yours does. You can easily modify my regex to fit your needs but if you want a proper open/close parenthesis check then that does have to be handled separately. Your issue is `(...*)*` – MonkeyZeus Jan 08 '20 at 18:16

1 Answers1

1

Detecting a well-formed expression is pretty complicated - I would not attempt the whole thing using a single regex - I'd expand your multiple pass approach to the level required to meet your needs. I would definitely start with a basic character filter:

var charRegex = /[^0-9+\-*\(\)\s]/i;
function testString(str) {
    if (!charRegex.test(str)) {
        // fail
        return false;
    }
    // do further tests...
}

Your further testing could include regex's to test for multiple operators in a row, unclosed brackets, etc. And if you do this inside a function, you can early return at the first failed test.

To really do this fully, I think you need some sort of recursive function, not a series of tests on the whole string. But if you don't want to go that far, I think a series of simpler regex's may be good enough for your needs.

arbuthnott
  • 3,819
  • 2
  • 8
  • 21
  • 1
    I have to evaluate the entire formula anyway, so maybe I just won't bother with the regex, I have a recursive function that evaluates the formula and fails out on an NaN (or when an operator doesn't have both operands). I was just expecting the regex to be a faster way of doing a high pass check – Zack Newsham Jan 08 '20 at 16:47
  • In the recursion, each expression would either need to be just a number or have the form [expression] operator [expression]. I think regex might be usable during these steps of the recursion. – arbuthnott Jan 08 '20 at 17:54