4

To my knowledge [ab] and (a|b) should be equivalent in purpose when trying to match against a set of characters. Now, look at two regex:

/^(\s|\u00A0)+|(\s|\u00A0)+$/g
/^[\s\u00A0]+|[\s\u00A0]+$/g

They should both match against whitespaces at the beginning and end of a string (See section on Polyfill here for more info on the regex itself). When using square brackets all things work well, but when you switch to parenthesis even the simplest of strings causes the browser to run seemingly indefinitely. This happens on latest Chrome and Firefox.

This jsfiddle demonstrates this:

a ="a                                                               b";

// Doesn't work
// alert(a.replace(/^(\s|\u00A0)+|(\s|\u00A0)+$/g,''));
// Works
alert(a.replace(/^[\s\u00A0]+|[\s\u00A0]+$/g,''));

Is this a crazy quirk with the browser's implementation of the regex engine or is there something else about the regex's algorithm which causes this?

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Parham
  • 3,157
  • 4
  • 31
  • 44
  • 3
    This question has been asked a lot. A quick search turned up this: http://stackoverflow.com/questions/22132450/why-is-a-character-class-faster-than-alternation – FLXN Jun 10 '15 at 15:24
  • 2
    @FLXN but the non-class regexp shouldn't freeze Chrome/Firefox for several minutes, right? This is more than just "using | is slower than character groups". – Ahmed Fasih Jun 10 '15 at 15:30
  • 1
    Check out https://regex101.com/r/jI0oA2/1 vs https://regex101.com/r/aW7kA7/1 especially the debugger which shows you why the bad one takes so long. On average, each extra space in the test string adds ~20 steps in the regexp engine for the bad case. For the good case, each space adds ~5 steps. – Ahmed Fasih Jun 10 '15 at 15:52
  • 1
    the ( and ) create addressable groupings, [ and ] do not, so (...)+ creates many groups, while [..]+ creates one. even if no matches are found, the first expression is a lot more work to figure that out. – Les Jun 10 '15 at 16:15
  • @FLXN: As Ahmed Fasih pointed out, I am wondering why this crashed the browsers and if this was a bug or something with how the regexs are expanded. That SO question does shed light on this matter together with Ahmed's links to regex101.com – Parham Jun 10 '15 at 16:51
  • As an additional comment, compare the number of steps from the regex links. The slower one takes about 4 times more steps. This should not crash the browser. I will edit my question to reflect this. – Parham Jun 10 '15 at 17:45
  • Sounds like simple catastrophic matching. – Etheryte Jun 10 '15 at 18:08
  • Hmm... it really seems to be a browser related problem.On Chrome and Firefox the browser hangs, meanwhile on IE and Safari I get an immediate result. – FLXN Jun 10 '15 at 20:10
  • Seems like V8 and SpiderMonkey don't like this expression but I must admit that I do not have enough deeper knowledge of the subject to further identify the cause of the problem. – FLXN Jun 10 '15 at 20:17
  • @AhmedFasih: Your test on regex101 is flawed. Since you didn't specify `u` flag in the PHP regex, `\s` doesn't match `U+00A0`, so the increase in number of steps is sub-exponential. With `u` flag on, you will see the number of steps increase exponentially. This is what happens with JavaScript regex, since `\s` in JavaScript matches `U+00A0` – nhahtdh Jun 11 '15 at 03:50
  • It took Chrome about five minutes to compute the slow regexp. @nhahtdh thanks for pointing out the error, I wanted to port the regexp to PCRE to use regex101's debugger (not available for JS regexps) and went with `\X00A0` as a hopeful equivalent to `\u00A0`. Please feel free to fix the PHP/PCRE regexp and post it! – Ahmed Fasih Jun 11 '15 at 13:24

1 Answers1

8

The problem you are seeing is called catastrophic backtracking, as explained here.

First of all, let me simplify and clarify your test case:

a = Array(30).join("\u00a0") + "b";  // A string with 30 consecutive \u00a0
s = Date.now();
t = a.replace(/^(\s|\u00A0)+$/g, '');
console.log(Date.now()-s, a.length);

What's happening is with the second part of the expression: ^(\s|\u00A0)+$. Note that \s matches a number of whitespace characters, including \u00A0 itself. This means both \s and \u00A0 matches each of the 30 \u00A0 characters.

Therefore if you try to match the string with /(\s|\u00A0)+/, you will find that each of the 2^30 different combinations of 30-character whitespace patterns will result in a match. When the regular expression matcher matched the first 30 characters it will try to match end of string ($) and failed, so it backtracks and ends up trying all 2^30 combinations.

Your original string (in jsfiddle, where the one in stackflow is already "normalized" to all spaces) is a \u00a0 \u00a0 ... \u00a0 b with roughly 30 \u00a0 characters, so it took the browser roughly 2^30 effort to complete. It does not hang the browser, but will take a few minutes to complete.

Alan Tam
  • 2,027
  • 1
  • 20
  • 33