2

Motivation

In this thread i would like to collect best practices and solutions how to encounter the issue of validating an email address, including international emails. There are a couple of ways, like structural checks, DNS lookup etc. But it seems, there are traps/edge cases along the way, which not everybody knows about. I hope you guys can help me collecting good links/code/tips, grouped by topic (e.g. server side, HTML preparation, ...).

Lets handle each area of interest in a separate answer.

Meaning of validation

If i use the term validation, i mean data validation. Wikipedia defines it:

[...] is the process of ensuring data have undergone data cleansing to ensure they have data quality, that is, that they are both correct and useful.

Source: https://en.wikipedia.org/wiki/Data_validation

Email address validation

Email address validation means, testing a string if its valid under the terms of RFC 5322. It is the latest version which describes the Internet Message Format used by emails. Reference: https://www.rfc-editor.org/rfc/rfc5322

That does not include checks, if email provider is valid (e.g. disposable emails) or if address makes sense (e.g. a1@a2.coo) or if TLD is available.

Not covered by most validators: International email addresses

An international email (ref) can contain all kinds of UTF-8 characters, which do not exist in ASCII.

Valid Examples based on the wiki article:

  • θσερ@εχαμπλε.ψομ
  • Dörte@Sörensen.example.com
Community
  • 1
  • 1
k00ni
  • 315
  • 4
  • 17
  • 3
    This has already been asked and answered here several times before. If you don't think that those answers are good enough, you should add this as an answer to those questions instead of creating yet a duplicate. – M. Eriksson Feb 25 '19 at 13:37
  • 1
    Possible duplicate of [How to validate an email address in PHP](https://stackoverflow.com/questions/12026842/how-to-validate-an-email-address-in-php) – M. Eriksson Feb 25 '19 at 13:39
  • The problem with all these threads and posts is, that it is hard to find useful solutions. They are either behind a link or one of the comments under a post. My intention with this thread is to outline and reference known solutions, discuss problems to some extend and outline up to date solutions. Some known solutions were up to date 5 years ago. Much has changed since then, e.g. regarding PHP 5 => 7. – k00ni Feb 25 '19 at 13:39
  • 2
    _"The problem with all these threads and posts is, that it is hard to find useful solutions"_ - Sure, but adding even _more_ posts and threads on the same subject won't make it easier to find. – M. Eriksson Feb 25 '19 at 13:40
  • I realized its more about international email addresses rather than classic ones. The answer below as well as the title was refined to better fit the question. It seems to me now, that is it more like an extension of the current set of solutions (e.g. posted on stackoverflow) complemented to also handle all kinds of international characters (UTF-8) in email addresses. – k00ni Feb 25 '19 at 14:55
  • Basically this is about writing a fully RFC compliant address validator for international addresses? It's a pretty well established fact that it's very very difficult to write such a thing, since it's not even a given that everyone agrees on the same RFC, and those are mightily complex and partly contradictory. And that still won't tell you whether the email address actually exists, is in use, or belongs to the user you think it does. So in practice, you just do an email verification loop (you send an email). Not sure how this can be substantially improved… – deceze Mar 11 '19 at 11:01
  • `Basically this is about writing a fully RFC compliant address validator for international addresses?` No. In my case i rely on mail addresses to be valid to some extend (e.g. structure: OK, domain exists: OK). Its rather "can i work with the address" than "is the address valid". I also read about sending SMTP-commands [here](https://stackoverflow.com/a/19263515/5301527), but its not reliable. I open for suggestions how to counter the problem. Maybe the title isn't as accurate as i thought, also. – k00ni Mar 11 '19 at 11:15

1 Answers1

1

Not a duplicate: This answer collects known solutions to validate an email address. It also contains information about known limitations when checking international emails. In the end i provide a possible solution how to encounter international emails.

filter_var

The author of this post, proposed the following function to validate an email:

function isValidEmail($email){ 
    return filter_var($email, FILTER_VALIDATE_EMAIL) !== false;
}

If you require a TLD to be part of the address, the author also proposed:

function isValidEmail($email) {
    return filter_var($email, FILTER_VALIDATE_EMAIL) 
        && preg_match('/@.+\./', $email);
}

Problem: No support for international email addresses

filter_var does not cover international email addresses, which contain UTF-8 characters such as Greek or Russian.


preg_match

Use custom regex to validate the structure. Good post with detailed description is here.

The author proposed a regex from http://emailregex.com/, which allows to check against the latest RDF 5322. The following code is the non-fixed version:

$regex = '/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}@)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*@(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/iD';

if (1 == \preg_match($regex, $email)) {
   // email OK
}

He also mentioned:

[...] RFC 5322 leads to a regex that can be understood if studied for a few minutes and is efficient enough for actual use. [...]

Problem: No support for international email addresses

This solution also not covers international addresses, which lead to no match.


Optional: DNS lookup

DNS lookup is not a validation, but could complement the check. It works with all UTF-8 characters, if they form a valid internationalized domain name (Reference: https://en.wikipedia.org/wiki/Internationalized_domain_name).

[...] is an Internet domain name that contains at least one label that is displayed in software applications, [...], in a language-specific script or alphabet, such as Arabic, Chinese, Cyrillic, Tamil, Hebrew or the Latin alphabet-based characters with diacritics or ligatures, such as French.

Via checkdnsrr you check if a given domain has a valid DNS record.

// $domain was extracted from the given email before
// $domain must end with a . (see comment below)

if (checkdnsrr($domain, 'MX') || checkdnsrr($domain, 'A') || checkdnsrr($domain, 'AAAA')) {
    // domain is VALID
}

User Martin mentioned at php.net, that the domain must end with a . to be considered valid. Without the point, you will get false positives.

Source: http://php.net/manual/en/function.checkdnsrr.php#119969


Handle international emails

Possible solution 1: structural check + DNS look up

What I have seen so far, you need a combination of structural checks + DNS look up to get the best coverage. The first part of the following code is based on the class EmailAddress from Genkgo Mail ( source ).

function mail_is_valid(string $address): bool {
    $hits = \preg_match('/^([^@]+)@([^@]+)$/', $address, $matches);

    if ($hits === 0) {
        // email NOT valid
        return false;
    }

    [$address, $localPart, $domain] = $matches;

    $variant = INTL_IDNA_VARIANT_2003;
    if (\defined('INTL_IDNA_VARIANT_UTS46') ) {
        $variant = INTL_IDNA_VARIANT_UTS46;
    }

    $domain = \rtrim(\idn_to_ascii($domain, IDNA_DEFAULT, $variant), '.') . '.';

    if (!\checkdnsrr($domain, 'MX')) {
        return \checkdnsrr($domain, 'A') || \checkdnsrr($domain, 'AAAA');
    } else {
        return true;
    }
}

I consider it the currently best solution, because the algorithm is mostly character agnostic, which allows UTF-8 characters in the email. That is valid, as long as you have a user-part + @ + domain-part. The DNS lookup ensures the domain exists.

Its not optimal. If you know a better way, please post it as comment or solution.

k00ni
  • 315
  • 4
  • 17
  • 2
    This answer relies heavily on links which can be dead in just a few days for all we know. Make sure to include all parts of what is relevant from the links in the answer so that in case the links go dead your answer is still valid and useful. Just to be clear, not my downvote – Andreas Feb 25 '19 at 13:39
  • I added the relevant code parts from the links in the post. Thanks for the advice. – k00ni Feb 25 '19 at 13:52
  • I added a note why this is not a duplicate. Also outlines known solutions + why they do not work for international domains. In the end a possible solution to encounter international email checks. – k00ni Feb 25 '19 at 14:33