12

Part of a website I am currently working on contains registration process where users have to provide their email address. Just recently I became aware that non-ascii based domains are possible (so is email). My backend is utf-8 encoded MySQL where I am expecting any users (with differnt locales) should be able to enter their email but don't know how to validate this kind of email address.

Currently I am testing out jquery tools and it validates the english email address correctly but fails to validate non ascii email. Also I need to do same at server side with php. Is there a regular expression that can validate this kind of email address?

I have tried this but it fails in jquery tools (this is just example for demo, I don't understand this too)

闪闪发光@闪闪发光.com

Also what will happen when they type their English email address (jonesmith@somemail.com) with their own IME. Can this be validated with current regular expression we have for English mail validation. Currently I don't have to worry if that email exist for not.

Thanks

Deepak Shrestha
  • 773
  • 2
  • 9
  • 25

7 Answers7

15

Attempting to validate email addresses may not be a good idea. The specifications (RFC5321, RFC5322) allow for so much flexibility that validating them with regular expressions is literally impossible, and validating with a function is a great deal of work. The result of this is that most email validation schemes end up rejecting a large number of valid email addresses, much to the inconvenience of the users. (By far the most common example of this is not allowing the + character.)

It is more likely that the user will (accidentally or deliberately) enter an incorrect email address than in an invalid one, so actually validating is a great deal of work for very little benefit, with possible costs if you do it incorrectly.

I would recommend that you just check for the presence of an @ character on the client and then send a confirmation email to verify it; it's the most practical way to validate and it confirms that the address is correct as well.

Community
  • 1
  • 1
Jeremy
  • 1
  • 85
  • 340
  • 366
  • Thanks for the suggestion. I wanted to know if mailers like sendmail or phpmail can handle this UTf-8 encoded email address right out of the box without any modification in my part. – Deepak Shrestha Mar 07 '11 at 13:03
  • 6
    While technically correct that validating an email with regex is _nearly_ impossible, I couldn't disagree more with this answer as a general solution. In most **real world** (non-theoretical) applications, you'd be storing the relevant email address in a database, and/or doing some manipulation on it in the future. Allowing any old UTF-8 string to pass unencumbered to the data layer is a **terrible** idea. I'd rather reject a few "off the wall" valid email addresses than have a 100% chance of a clever injection attack. In the real world, `"hi"\ ~e^ery!@myhost` won't come up too often. – s.co.tt Oct 31 '13 at 20:00
2

As offered by Mario, playing around a bit, I came up with the following regex to validate non-standard email address:

^([\p{L}\_\.\-\d]+)@([\p{L}\-\.\d]+)((\.(\p{L}){2,63})+)$

It would validate any proper email address with all kind of Unicode letters, with TLD limitations from 2 to 63 characters.

Please check it and let me know if there are any flaws.

Example Online

Ilia Ross
  • 13,086
  • 11
  • 53
  • 88
  • It's valid for PHP, not for JavaScript. – D.A.H Aug 17 '14 at 20:42
  • 1
    @D.A.H JavaScript does not support Unicode shortcuts. You could use *Steven Levithan's XRexExp package with Unicode add-ons* - http://xregexp.com/plugins/. – Ilia Ross Aug 17 '14 at 22:10
  • What a nice email address! :-) Okay, I've updated the regex. Underscores are indeed allowed by many email providers. Thanks. – Ilia Ross Aug 06 '18 at 21:17
  • @IliaRostovtsev Sorry, didn't see your comment until now. Upvoted. Thanks! – Jeremy Nov 13 '18 at 19:07
  • Note for 2021: UTF-8 additions in PCRE (tested in preg_replace in PHP 7.3) may prefer \p{Pd} instead of \- for hyphens, and \p{Nd} instead of \d for decimal numbers if your code seems to fail after upgrading. – Jeff Clayton Oct 04 '21 at 17:11
2

Since 5.2 PHP has a build in validation for email addresses. But I'm not sure if it works for UFT-8 encoded strings:

echo filter_var($email, FILTER_VALIDATE_EMAIL);

In the original PHP source code you will find the reg exp for validating email, this can be used for manually validating when using PHP < 5.2.

Update

idn_to_ascii() can be used to "Convert domain name to IDNA ASCII form." Which then can be validated with filter_var($email, FILTER_VALIDATE_EMAIL);

// International domains
if (function_exists('idn_to_ascii') && strpos($email, '@') !== false) {
    $parts = explode('@', $email);
    $email = $parts[0].'@'.idn_to_ascii($parts[1]);
}
$is_valid = filter_var($email, FILTER_VALIDATE_EMAIL);
powtac
  • 40,542
  • 28
  • 115
  • 170
0

a reg exp could be something like this:

[^ ]+@[^ ]+\.[^ ]{2,6}
powtac
  • 40,542
  • 28
  • 115
  • 170
  • 4
    There is nothing limiting TLDs to 2-6 characters, and given ICANN's decision to allow the creation of arbitrary ones it seems reasonable to assume that addresses such as `.microsoft` will be in use before too long. Also, it is possible for spaces to be included in valid email addresses if they are properly escaped. – Jeremy Mar 07 '11 at 13:00
  • 2
    no prob, extend the {2,6} to what ever you want. It could also replaced by [^ ]. – powtac Mar 07 '11 at 13:07
  • Thanks for the info. Validation of this kind seems like a herculean task to me. – Deepak Shrestha Mar 07 '11 at 13:10
  • It is not a trivial question. Try to cover as much as you can with your reg exp. Check this link to see what the real reg exp would look like in PERL: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html – powtac Mar 07 '11 at 13:13
0

Got this idea from Javascript tutorial page. It is basic but it works for me without worrying about complexity of regular expressions and unicode standards.

Client side validation

if(!$.trim(value).length) {
    return false;
}
else {

    AtPos = value.indexOf("@");
    StopPos = value.lastIndexOf(".");

    if (AtPos == -1 || StopPos == -1) {
        return false;
    }

    if (StopPos < AtPos) {
        return false;
    }

    if (StopPos - AtPos == 1) {
        return false;
    }

    return true;
}

Serverside validation

if(!isset($_POST['emailaddr']) || trim($_POST['emailaddr']) == "") {
    //Error: Email required
}
else {
    $atpos = strpos($_POST['emailaddr'],'@');
    $stoppos = strpos($_POST['emailaddr'],'.');

    if(($atpos === false) || ($stoppos === false)) {
        //Error: invalid email
    }
    else {
        if($stoppos < $atpos) {
            //Error: invalid email
        }
        else {
            if (($stoppos-$atpos) == 1) {
            //Error: invalid email
        }
    }
}

Though it still has some loop holes, I guess users will not be fooling around with this stuff. Also real validation is requierd for serious stuff as suggested by 'Jeremy Banks'.

Hope this will be helpful for somebody else too.

Thanks and regards to all

Deepak Shrestha
  • 773
  • 2
  • 9
  • 25
-1

On this subject I liked this page so much that I set up a blog exposing sites that do validation wrong (contributions gratefully received - don't let yours be on it!).

As far as using regexes go, those that say "it's wrong", tend to be light on alternatives, and TBH validation to the last letter of the RFC isn't really that critical - for example while noddy+!#$%&'*-/=?+_{}|~test@gmail.com is a perfectly valid address, it's not too unreasonable to reject it given that a surprisingly large proportion of users can't even type 'hotmail' correctly. Some domains are also quite restrictive on user names anyway, particularly hotmail. So I'm in favour of regexes that are demonstrably reasonable, and my favourite source for that is this page, though I don't like their current JS 'winner' and it would help if they set up a public test page.

jQuery's validate plugin uses this regex which is interestingly constructed, quite similar in style (but smaller!) to the ex-parrot one (actually my ISP!) linked by @powtac .

Synchro
  • 35,538
  • 15
  • 81
  • 104
-3

what is about something this:

mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
mb_ereg('[\w]+@[\w]+\.com',$mail,'UTF-8');
The Bndr
  • 13,204
  • 16
  • 68
  • 107
  • that regex doesn't really do any validation (will return false positives and false negatives) – symcbean Mar 07 '11 at 16:46
  • \w doesn't match . or - (which are valid characters for both domain and email) – Edson Medina Jan 08 '13 at 15:26
  • @EdsonMedina >all emails end with .com< That depends. This answer is more an example. If you build an company internal webpage and if you need to validate the mail address in order to allow company internal address only, than this could by one way. Of cause an strict mail-syntax is needed. – The Bndr Nov 20 '14 at 09:04