0

I'm debugging a problem in an application that uses a regular expression to validate emails clientside (yeah I know, both is kinda stupid) and the problem got me really stumped.

The thing is that the validation works just fine in Chrome but fails in Firefox and I wonder if it is a bug or if there's something wrong with the regular expression that causes the error.

Please check this fiddle for a complete test case: http://jsfiddle.net/KQvgJ/

new RegExp(/^\S+([\_\-\.]*\S+[\_\-]?)*@\S+([\_\-]?\S+)*\.+([\-\_]?\S)+(\.?\S+)*$/);

In Firefox only the regex above matches mw@thisissometest.de but not mw@thisissometestbutlong.de.

It seems to fail based on the length of the input alone, but there is no length restriction in the expression at all!?

Andreas Gohr
  • 4,617
  • 5
  • 28
  • 45
  • There is a lot of useless code in that regex. Just use `^\S+@\S+$`. You may also want to read [this](http://davidcel.is/blog/2012/09/06/stop-validating-email-addresses-with-regex/). – HamZa Jul 18 '13 at 10:21
  • @HamZa: That will validate too much, even invalid stuff such as `@@@`. – Mario Jul 18 '13 at 10:23
  • 1
    I already stated that validating emails by regexps is stupid. That's not what I'm asking.I want to know if there is something in that special regexp that makes two different JS engines calculate it differently and what it is. – Andreas Gohr Jul 18 '13 at 10:24
  • @Mario well look at his expression, it will also match that kind of stuff. Conclusion: don't validate email addresses, you will most likely fail. If you do, just check if there is `@` :) – HamZa Jul 18 '13 at 10:24
  • Is there any specific reason you try to escape underscores? Maybe that's confusing it (would still consider it a bug). – Mario Jul 18 '13 at 10:25
  • 1
    See also [Using a regular expression to validate an email address](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address) – Spudley Jul 18 '13 at 10:59
  • 2
    Read this: [Runaway Regular Expressions: Catastrophic Backtracking](http://www.regular-expressions.info/catastrophic.html) - You create something similar here even you've got the `@` in there that saves you a little (and you only use two letters before it). – hakre Jul 18 '13 at 11:40
  • @hakre that seems to be a good explanation, thank you. Can you make your comment into a proper answer for upvoting? – Andreas Gohr Jul 18 '13 at 11:58
  • @AndreasGohr: No, beacause this answer already exists on this website. Please if you run into a problem with something specific in programming, find the specs about it. Also you should create an issue with Firefox here because even though this smells like catastrophic backtracking, it's not clear why this doesn't throw an exception but returns as if it wouldn't match. So this smells like a flaw here. Just saying, please write a nice issue report in the firefox bugtracker, provide your examples and listen to feedback requests. – hakre Jul 18 '13 at 12:01

2 Answers2

3

Improving the concept

First, let's make clear that \S+ will match anything that's not a white space one or more times.

^\S+([\_\-\.]*\S+[\_\-]?)*@\S+([\_\-]?\S+)*\.+([\-\_]?\S)+(\.?\S+)*$
    ^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^ ^^ ^^^^^^^^^^^^^^^^^^^^
    This all get's matched     Same here    wut?     Same here, just
     with \S+, so we can                   repeat         use \S+
       drop it                              dots?

So we can make it simpler by just using \S+@\S+\.\S+, but wait ? If it's like that we could just use \S+ which is non-sensical. Let's use ^[^\s@]+@[^\s@]+$.

  • ^ : begin of line
  • [^\s@]+ : match everything except whitespace and @
  • @ : match @
  • [^\s@]+ : match everything except whitespace and @
  • $ : end of line

Fixing your regex

Let's fix your regex. Note that in a character class you don't need to escape dots, underscores etc... Put the hyphens at the beginning and you don't need to escape it too. After this, let's remove that ugly quantifier in \.+, the result should look like: ^\S+([-_.]*\S+[-_]?)*@\S+([-_]?\S+)*\.([-_]?\S)+(\.?\S+)*$/

Now by eliminating some parts of the regex, I have found the culprit. It's the \S+:

^\S+([-_.]*\S+[-_]?)*@\S+([-_]?\S+)*\.([-_]?\S)+(\.?\S+)*$
     here --^

So your final regex should be ^\S+[-_.]*@\S+([-_]?\S+)*\.([-_]?\S)+(\.?\S+)*$.
See it working !

Now if you ask me why ? I honestly don't know but as always I recommend to read the following article : Stop Validating Email Addresses With Complicated Regular Expressions.

HamZa
  • 14,671
  • 11
  • 54
  • 75
2

The fault is definitely in your regex: it's pathologically inefficient. Basically, you've got multiple consecutive parts that can match the same characters, all controlled by open-ended quantifiers (* and +). This creates an astronomical number of "paths" the regex has to check before it gives up on the match. In fact, this sort of problem usually becomes apparent only when no match is possible, but you've managed to trigger it on a regex that should match.

I suspect you were trying for something like this:

/^[a-z]+(?:[_.-][a-z]+)*@[a-z]+(?:\.[a-z]+)*$/i

Before anyone starts criticizing, I know [a-z]+ is no more correct than \S+. I'm just trying to explain what's wrong with his regex. The idea is to force the user name and the domain name to start with letters, while allowing them to be separated into chunks around delimiters like ., -, and _. That's what makes it so complicated

The most important feature of this regex is that it always moves forward. When [a-z]+ runs out of letters to consume, the next thing it sees must be one of the delimiter characters, an at-sign ('@'), or the end of the string (depending on which part of address it's matching). If it doesn't see what it expects, the match attempt fails immediately.

In your regex the \S+ part initially gobbles up the whole string, then starts backing off one character at a time to give the next part a chance to match. This process is repeated for every \S+. As HamZa observed, that's where the regex engine spends most of its time. But it's not the \S+ alone that's killing you, it's the structure of the regex.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156