0

After looking for a good email validation routine, I found this answer to a similar question and decided that it looked like the most likely candidate. I implemented the following class for email validation (The RegexMatch class it inherits from validates a string against a regular expression as provided in the 'needle' key of an associative configuration array):

class Email extends RegexMatch implements iface\Prop
{
    const
        /**
         * Regular expression for validating email addresses
         * 
         * This regex is meant to validate against RFC 5322 and was taken from
         * a post on Stack Overflow regarding email validation (see the links)
         * 
         * @link http://www.ietf.org/rfc/rfc5322.txt, https://stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses/1917982#1917982
         */
         PATTERN    = '
/(?(DEFINE)
   (?<address>         (?&mailbox) | (?&group))
   (?<mailbox>         (?&name_addr) | (?&addr_spec))
   (?<name_addr>       (?&display_name)? (?&angle_addr))
   (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
   (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ;
                                          (?&CFWS)?)
   (?<display_name>    (?&phrase))
   (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

   (?<addr_spec>       (?&local_part) \@ (?&domain))
   (?<local_part>      (?&dot_atom) | (?&quoted_string))
   (?<domain>          (?&dot_atom) | (?&domain_literal))
   (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                 \] (?&CFWS)?)
   (?<dcontent>        (?&dtext) | (?&quoted_pair))
   (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

   (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&\'*+-\/=?^_`{|}~])
   (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
   (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
   (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

   (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
   (?<quoted_pair>     \\ (?&text))

   (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
   (?<qcontent>        (?&qtext) | (?&quoted_pair))
   (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                        (?&FWS)? (?&DQUOTE) (?&CFWS)?)

   (?<word>            (?&atom) | (?&quoted_string))
   (?<phrase>          (?&word)+)

   # Folding white space
   (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
   (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
   (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
   (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
   (?<CFWS>            (?: (?&FWS)? (?&comment))*
                       (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

   # No whitespace control
   (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

   (?<ALPHA>           [A-Za-z])
   (?<DIGIT>           [0-9])
   (?<CRLF>            \x0d \x0a)
   (?<DQUOTE>          ")
   (?<WSP>             [\x20\x09])
 )

 (?&address)/x';

    public function setConfig (array $config = array ())
    {
        $config = array_merge ($config, array ('needle' => self::PATTERN));
        return (parent::setConfig ($config));
    }

    public function isValid ()
    {
        return ((is_null ($this -> getData ()))
            || (parent::isValid ()));
    }
}

I also built a PHPUnit test that runs this class against various permutations of valid and invalid email addresses culled from various sources (mostly Wikipedia).

The class seems to function in a lot of more mundane cases, but it's running into issues in that it passes some emails that are supposed to be invalid, and fails some that are supposed to be okay. I've listed them below:

  • much."more\ unusual"@example.com (Fails, supposed to be valid)
  • "(),:;<>[\]@example.com (Passes, supposed to be invalid)
  • just"not"right@example.com (Passes, supposed to be invalid)
  • A@b@c@example.com (Passes, supposed to be invalid)
  • this\ is\"really\"not\\allowed@example.com (Passes, supposed to be invalid)

PHP seems to parse the regex correctly, it doesn't emit any errors, warnings or notices. Also, all my other test cases (7 other valid addresses and 2 other invalid) are passed or failed as they should be, so I doubt it's because my version of PHP (5.3.8) doesn't support the regex syntax being used here. But as I've got both false positives and false negatives there's obviously something wrong. Either my test data is incorrect (which as I said I mostly culled from Wikipedia), or the regex as is is incorrect in some way.

Is the regex as entered above correct? If not, what corrections need to be made? If it is correct, then is there something wrong with my test cases?

EDIT: I also forgot to mention, as this is a validation class it needs to only pass strings that contain an email address and nothing else. I don't want to pass strings that contain a valid email address within non-email address data. I know you do that by using ^pattern_goes_here$ but this regular expression is rather more advanced than most I've worked with in the past, and I'm not sure where the ^ and $ should go. If you could also help with that I'd appreciate it.

Community
  • 1
  • 1
GordonM
  • 31,179
  • 15
  • 87
  • 129
  • you can't use: `filter_var('bob@example.com', FILTER_VALIDATE_EMAIL)`? – Book Of Zeus Dec 17 '11 at 21:55
  • 2
    Please note that a regex email validator is VERY hard to make since the spec is very big! I don't think there is a single regex that matches all possible cases. – PeeHaa Dec 17 '11 at 21:59
  • @GordonM: regarding you edit. Just place it just after and before the delimiters as you would normally do. – PeeHaa Dec 17 '11 at 22:03
  • @Book Of Zeus: FILTER_VALIDATE_EMAIL fails two addresses that should pass (Abc\@def@example.com and very."(),:;<>[]".VERY."very\\\ \@\"very".unusual@strange.example.com). – GordonM Dec 17 '11 at 22:07
  • @GordonM that is very interesting, i never thought that would pass, thanks for the info. maybe you can simply remove all the non-alpha (except the dots, underscore, dash) and then perform a validation? – Book Of Zeus Dec 17 '11 at 22:09
  • @Book Of Zeus: yeah, the thought did cross my mind, but I suspect that would result in a disproportionate number of false negatives (addresses that should fail being passed). – GordonM Dec 17 '11 at 22:12
  • I think if someone came at my app with those "valid" email addresses, I would be comfortable ignoring them... `:P` – Jared Farrish Dec 17 '11 at 22:14
  • @GordonM you got a point there! – Book Of Zeus Dec 17 '11 at 22:15

2 Answers2

2

Fully validating email addresses is a very tricky business.

Here's a list, complete with tests, that show different ways to tackle it, but none of them will pass all cases.

http://fightingforalostcause.net/misc/2006/compare-email-regex.php

The expression with the best score is currently the one used by PHP's filter_var(), which is based on a regex by Michael Rushton

I strongly suggest you use filter_var()

Community
  • 1
  • 1
adlawson
  • 6,303
  • 1
  • 35
  • 46
  • I'm going to go with this solution on the simple grounds that it's numerically more successful with my test suite. However, it still fails a couple of tests it should pass. – GordonM Dec 18 '11 at 07:07
1

If you want to add ^ and $ anchors, this would be the place:

  ^(?&address)$  /x';

You also need to verify your email test case resources. I would trust those regex subroutines more, as someone wrote it by translating the BNF declarations from the RFC.

mario
  • 144,265
  • 20
  • 237
  • 291
  • I guessed it would be something like that, thanks for conforming. Adding the line anchors results in 3 false positives (addresses that should pass but fail instead), but only one false negative (empty string passes when it should fail). I think I'm going to go with the other solution though simply due to the fact it only has 2 false positives and no false negatives against my test data. That said, the test data could be suspect, I only have the source's word regarding how valid those addresses are. If you know a reliable source of test data I'd be grateful for it. – GordonM Dec 18 '11 at 07:06
  • You could eventually try `^(?&mailbox)$ /x` as alternative, which is probably more restrictive. No idea about the false negative however. -- But the builtin filter_var regex seems most sufficient to me too. – mario Dec 18 '11 at 07:09