24

Does anyone know a regex that validates email addresses according to RFC5321/RFC5322?

Since (nestable) comments make the grammar irregular, only addresses without comments should be regarded.

Of course, if you're interested in validating an address that is actually owned by someone then the only real validation is to send an email to the address and check if the owner received it. I am however purely interested in the RFC standards. For a practical approach this question is more relevant.

On top of comments I am willing to sacrifice folding white space, but apart from that I'm not interested in expressions that reject any addresses that are RFC5321/2-valid. (Arguably it would even make sense in some circumstances to disregard folding white space.)

Ideally the regex would reject anything that's not RFC-valid, but that's less important. It's not so interesting to include an exhausive list of top-level domains in the regex for example. Simply accepting any top-level domain will suffice.

I'm not sure if address tags (e.g. address+tag@domain.org) are part of the RFCs I mentioned, but I would like the regex to validate these.

IPv6 should definitly be handled correctly (RFC5952).

As I understand internationalized email (RFC6530, RFC6531, RFC6532, RFC6533) is still in the experimental phase, but an expression validating these addresses would also be interesting.

To make the answers universally interesting it would be nice if any regular expressions were in POSIX format.

Community
  • 1
  • 1
Rinke
  • 6,095
  • 4
  • 38
  • 55
  • 1
    That's impossible with traditional regex flavours. Email adresses can contain comments with arbitrarily deep nesting, and such is not parsable by a regular expression grammar. – Bergi Dec 21 '12 at 15:21
  • @Bergi - True (and very good point). But if the (possibly nested) comments are first stripped out, then it can be done. This is how the perl regex solution linked to by Rafał Toboła does it. – ridgerunner Dec 21 '12 at 17:26

1 Answers1

30

Nestable comments make the grammar for email-addresses irregular (context-free). If you preclude comments however, the resulting grammar is regular. The primary definition allows for (folding) whitespace between lexical tokens (e.g. a @ b.com). Removing all folding whitespace results in a canonical form.

This is the regex for canonical email addresses according to RFC 5322 (precluding comments):

([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|"([]!#-[^-~ \t]|(\\[\t -~]))+")@([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|\[[\t -Z^-~]*])

If you need to accept folding whitespace, then this is the regular expression for email addresses according to RFC 5322 (precluding comments):

((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?"(((([\t ]*\r\n)?[\t ]+)?([]!#-[^-~]|(\\[\t -~])))+(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?)"(([\t ]*\r\n)?[\t ]+)?)@((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?\[((([\t ]*\r\n)?[\t ]+)?[!-Z^-~])*(([\t ]*\r\n)?[\t ]+)?](([\t ]*\r\n)?[\t ]+)?)

Valid email addresses are further restricted in RFC 5321 (SMTP). It basically leaves alone the part before the @-sign, but accepts only host names or address literals after the @-sign. ("---.---" is a valid dot-atom, but not a valid host name and "[...]" is a valid domain literal, but not a valid address literal.)

The grammar presented in RFC 5321 is too lenient when it comes to both host names and IP addresses. I took the liberty of "correcting" the rules in question, using this draft and RFC 1034 (section 3.5) as guidelines. Here's the resulting regex.

([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|"([]!#-[^-~ \t]|(\\[\t -~]))+")@([0-9A-Za-z]([0-9A-Za-z-]{0,61}[0-9A-Za-z])?(\.[0-9A-Za-z]([0-9A-Za-z-]{0,61}[0-9A-Za-z])?)*|\[((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}|IPv6:((((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){6}|::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){5}|[0-9A-Fa-f]{0,4}::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){4}|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):)?(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){3}|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,2}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){2}|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,3}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,4}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::)((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3})|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3})|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,5}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3})|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,6}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::)|(?!IPv6:)[0-9A-Za-z-]*[0-9A-Za-z]:[!-Z^-~]+)])

All regexes are POSIX EREs. The last one uses a negative lookahead. See here for the derivations of the regular expressions.

Community
  • 1
  • 1
Rinke
  • 6,095
  • 4
  • 38
  • 55
  • This regexps are no complaint with rfc6532, due it restricts contact part to ascii. – Mihail Krivushin May 17 '18 at 08:08
  • @MihailKrivushin Couldn’t agree more. The question was about RFC5321/2 specifically though... – Rinke May 17 '18 at 08:19
  • Why is there no `a-z` in the character groups in the first regex. And what characters does the `^-~` include? Is that range wanted? – mxmlnkn May 02 '19 at 20:31
  • @mxmlnkn The `a-z` range is included in `^-~`. If you search for an ASCII table you can see which characters are included in the ranges. – Rinke Jul 26 '19 at 15:20
  • this throws a empty character class warning from eslint - https://eslint.org/docs/rules/no-empty-character-class – LonelyCpp Dec 19 '19 at 06:11
  • The first regex is not valid as it accepts "test@test" as valid. – Tomislav Brabec Jul 21 '23 at 13:33
  • 1
    @TomislavBrabec That's because "test@test" is a valid address according to the RFC. See [this answer](https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression/14075810#14075810) for more about extra validations on top of the RFCs. – Rinke Aug 10 '23 at 09:59