3

Recently, I started using the email validation regular expression from the JQuery validation plugin in my Rails models.

EMAIL_REGEXP=/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))$/i

"aaaaa.bbbbbb.ccccc.ddddd@gmail.com".match EMAIL_REGEXP  # returns immidiately
"aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

The regular expression takes a long time when an invalid email has many dot separated tokens (E.g: first.middle.last@gmail). Same expression works without any noticable delay in JavaScript.

Why is there such a difference in performance between Ruby and JavaScript regular expression parsers? Is there anything I can do to improve the response time?

I am on Ruby 1.8.7. I don't see the same issue on Ruby 1.9.2.

Note

I know the reg-exp is long. Since it is used by jQuery, I thought of using it. I can always change it back to a simpler regexp as shown here. My question is mostly about finding out the reason why the same regular expression is much faster in JS.

Reference:

JQuery Validation Plugin Source

Sample form with jQuery email validation

Harish Shetty
  • 64,083
  • 21
  • 152
  • 198
  • 2
    That is one huge regex. Why don't you use an email parser? – Blender Nov 29 '11 at 04:44
  • @Blender comment quality would improve immensely if you supplied a link – tekknolagi Nov 29 '11 at 04:45
  • 2
    Well this regular expression is taken from the official `jQuery` plugin and it is used widely. So I thought it is a safe route. I might go down the route of custom validation if I can't get the RegExp to work. – Harish Shetty Nov 29 '11 at 04:46
  • Another good question is, what are you trying to accomplish by matching the email address? You can't prove if it is valid; because even if a pattern matches, the address might not be live. The accepted way to test if an email is valid is to send a message to that address asking for a reply, and see if you get the response. See http://stackoverflow.com/q/1910340/128421 – the Tin Man Nov 29 '11 at 05:19
  • which ruby version? 1.9 handles regex very different from 1.8 – pguardiario Nov 29 '11 at 05:22
  • Also see http://stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses – the Tin Man Nov 29 '11 at 05:25
  • You could try something like this: http://www.linuxjournal.com/article/9585 – Blender Nov 29 '11 at 14:56

2 Answers2

1

The problem might be in that your Regexp contains a greedy quantifiers, so Ruby as those quantifiers require trying to check all combinations. The solution might be to use Possessive Quantifiers, so the look-up whould be mush faster but it will change regexp so some strings will no longer match. Short Example (from wikipedia):

'aaaaaaaaaaaaaaaaaaaaaaaaa' =~ /(a+a+)/ => match
'aaaaaaaaaaaaaaaaaaaaaaaaa' =~ /(a++a+)/ => not match

The differense is in lookup process and in greedy quantifiers engine trying to look-back if no match, in the case of possessive quantifiers engine never looks back.

Mark Huk
  • 2,379
  • 21
  • 28
  • I was expecting the regexp to be slower due to greedy quantifiers. What surprises me is the difference in performance between the JS and Ruby execution times for the same reg-exp. – Harish Shetty Nov 29 '11 at 05:49
  • 1
    I have tryed your exmaple at [tryruby](http://tryruby.org/) and there was not big delays in code execution. By the way, second string dosnt match. – Mark Huk Nov 29 '11 at 05:57
  • Tryruby uses 1.9.2 that has new regex engine and handles this regexp very good. On the other hand 1.8.7 simply dies on the second example from the question. – KL-7 Nov 29 '11 at 06:44
  • Tryruby uses ruby 1.9.2. I see the issue only on 1.8.7.Second string is not supposed to match. – Harish Shetty Nov 29 '11 at 07:45
1

Don't know why regex parser from 1.8.7 is so much slower than the one from JS or Oniguruma from 1.9.2, but may be this particular regex can benefit from wrapping its prefix including @ symbol with atomic group like that:

EMAIL_REGEXP = /
  ^
  (?>(( # atomic group start
    ([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+
    (\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*
   )
   |
   (
     (\x22)
     (
       (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
       (
         ([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
         |
         (\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))
       )
     )*
     (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
     (\x22)
   )
  )
  @) # atomic group end
  (
    (
      ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      |
      (
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
        ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      )
    )
    \.
  )+
  (
    ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    |
    (
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    )
  )
  $
  /xi

puts "aaaaa.bbbbbb.ccccc.ddddd@gmail.com".match EMAIL_REGEXP  # returns immediately
puts "aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

Atomic group in that case should prevent parser from returning to the first part of the string when matching the part following @ symbol failed. And it gives significant speed-up. Though, I'm not 100% sure that it doesn't break regexp logic, so I'd appreciate any comments.

Another thing is using non-capturing groups that should be faster in general when you don't need to backreference for groups, but they don't give any noticeable improvement in this case.

KL-7
  • 46,000
  • 9
  • 87
  • 74