2

I have a problem when verifying a username input field.

The regex:

var mailformat = /^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,4})+$/;
var letterNumber = /^[0-9a-zA-Z]*$/;

The check :

if (!letterNumber.test($('#login_username').val()) 
      && !mailformat.test($('#login_username').val())){
 ...
}

The browser (chrome and ie) hangs when I want to verify with this entry:

012345678901234567890123456789@

Anyone any idea why?

I tried with the regex in one expression, but got the same result. (regex are ok because seperately they work)

user2992220
  • 1,092
  • 1
  • 12
  • 20

2 Answers2

5

You have stumbled upon catastrophic backtracking !

To fix it, you can use:

^\w+([.-]\w+)*@\w+([.-]\w+)*(\.\w{2,4})+$

Before declaring a failure to match, a regex engine has to check all possible ways of applying the regex.

With \w+([.-]?\w+)*, each digit could be matched both by \w+ or by ([.-]?\w+)*, and on failure ([.-]?\w+)* can behave the same as (\w+)*. This was creating a lot of states to come back to.

Making the [.-] mandatory removes this ambiguity.


On (catastrophic) backtracking

The way a regex engine works is that it moves through the regex, trying to match each part of it to the string. Each time it has a choice, it saves the state so that it can backtrack to it if this was the wrong path.

To illustrate, see how .*b matches abc.

  • The .* quantifier is greedy so it will match as much as possible: it starts by matching nothing, then adds a, then adds b, then adds c. Each time the engine remembers it chose to add another character. Then the engine tries to match b, but we're at the end of the string so that fails.

  • No biggie, we come back to the last choice we made: now .* only matches ab. The final b still doesn't match the string's ending c.

  • We come even further back: .* matches a, b can finally match. Overall match found: ab

Before knowing the regex cannot match the input, the engine has to try all these alternatives. If you allow a character to be matched by multiple parts of the regex, that number increases.

In your case it was increasing exponentially with the length of the string, to the point that it takes forever to return anything: that's catastrophic backtracking.

Great illustration by Friedl on the subject: http://www.foo.be/docs/tpj/issues/vol1_2/tpj0102-0006.html

Robin
  • 9,415
  • 3
  • 34
  • 45
  • Thanks for the answer. Indeed removing the '?' fundamentally doesn't change the regex. I'm checking yor answer and reading the references (2^30 steps is much indeed) – user2992220 Jun 03 '14 at 09:12
2

Because catastrophic backtracking is at play here. Most of the terms in your regex

`^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,4})+$`

almost match the full length of the test string

`012345678901234567890123456789@`

And then it has to backtrack because the complete match has not occurred.

enter image description here

eg. ^\w+ matches 012345678901234567890123456789@ and so does ^\w+([\.-]?\w+)* and eventually ^\w+([\.-]?\w+)*@ matches the full string.
Now there are words following @ as well viz. ^\w+([\.-]?\w+)*@\w+ in the regex and thus it has to backtrack and find a suitable match for the preceding part ie. ^\w+([\.-]?\w+)*@ once again and this process of backtracking continues until it finds a match.

Dhrubajyoti Gogoi
  • 1,265
  • 10
  • 18
  • I checking your answer, and indeed the mailformat regex also doesn't function on itself. – user2992220 Jun 03 '14 at 08:48
  • There are multiple article's on the topic eg. http://www.regular-expressions.info/catastrophic.html and https://en.wikipedia.org/wiki/ReDoS#Evil_regexes. I hope that helps. – Dhrubajyoti Gogoi Jun 03 '14 at 08:53
  • @user2992220 For email validation you may refer to a previous [SO question](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address). It has quite a few good options. – Dhrubajyoti Gogoi Jun 03 '14 at 09:11
  • Thanks for the correct response. Will appy solution suggested by Robin. – user2992220 Jun 03 '14 at 09:19