What are the valid characters in the domain part of e-mail address?

Question

Intention

I'm trying to do some minimal very minimal validation of e-mail addresses, despite seeing a lot of advice advising against doing that. The reason I'm doing this is that spec I am implementing requires e-mail addresses to be in this format:

mailto:<uri-encoded local part>@<domain part>

I'd like to simply split on the starting mailto: and the final @, and assume the "local part" is between these. I'll verify that the "local part" is URI encoded.

I don't want to do much more than this, and the spec allows for me to get away with "best effort" validation for most of this, but is very specific on the URI encoding and the mailto: prefix.

Problem

From everything I've read, splitting on the @ seems risky to me.

I've seen a lot of conflicting advice on the web and on Stack Overflow answers, most of it saying "read the RFCs", and some of it saying that the domain part can only be certain characters, i.e. 1-9 a-z A-Z -., maybe a couple other characters, but not much more than this. E.g.:

What characters are allowed in an email address?

When I read various RFCs on domain names, I see that "any CHAR" (dtext) or "any character between ASCII 33 and 90" (dtext) are allowed, which implies @ symbols are allowed. This is further compounded because "comments" are allowed in parens ( ) and can contain characters between ASCII 42 and 91 which include @.

RFC1035 seems to support the letters+digits+dashes+periods requirement, but "domain literal" syntax in RFC5322 seems to allow more characters.

Am I misunderstanding the RFC, or is there something I'm missing that disallows a @ in the domain part of an e-mail address? Is "domain literal" syntax something I don't have to worry about?

Note that there is a bug in submission with `mailto:...` lines embedded in code, so please ignore the whitespace in that code line describing the format of my input. See - http://meta.stackexchange.com/questions/183642 — Merlyn Morgan-Graham, Jun 08 '13 at 18:05

score 2 · Accepted Answer · edited Oct 07 '21 at 06:06

The most recent RFC for email on the internet is RFC 5322 and it specifically addresses addresses.

addr-spec       =   local-part "@" domain
local-part      =   dot-atom / quoted-string / obs-local-part

The dot-atom is a highly restricted set of characters defined in the spec. However, the quoted-string is where you can run into trouble. It's not often used, but in terms of the possibility that you'll run into it, you could well get something in quotation marks that could itself contain an @ character.

However, if you split the string from the last @, you should safely have located the local-part and the domain, which is well defined in the specification in terms of how you can verify it.

The problem comes with punycode, whereby almost any Unicode character can be mapped into a valid DNS name. If the system you are front-ending can understand and interpret punycode, then you have to handle almost anything that has valid unicode characters in it. If you know you're not going to work with punycode, then you can use a more restricted set, generally letters, digits, and the hyphen character.

To quote the late, great Jon Postel: TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.

Side note on the local part: Keeping in mind, of course, that there are probably lots of systems on the internet that don't require strict adherence to the specs and therefore might allow things outside of the spec to work due to the long standing liberal-acceptance/conservative-transmission philosophy.

It's the domain part and splitting on the `@` I care most about, though your feedback and warnings are appreciated :) — Merlyn Morgan-Graham, Jun 08 '13 at 18:10
Too little sleep last night. Let me revise with a little more emphasis on what you actually asked... — gaige, Jun 08 '13 at 18:10
Close - [RFC1035 matches what many people say about letters+digits+hyphens+dots](http://tools.ietf.org/html/rfc1035#page-8) - but I am reading RFC5322 to say that any ["printable UI-ASCII character"](http://tools.ietf.org/html/rfc5322#page-18) is allowed, within ranges that seem to include the `@` symbol, to support "domain literal" form. I am hoping I am just misunderstanding it though :) — Merlyn Morgan-Graham, Jun 08 '13 at 19:45
This is part-and-parcel of why I put the section in there about punycode. Theoretically, almost anything, including non-ascii, is valid in RFC5322 as a domain. However, when things go to the mailer, it has to be translated to something that's valid for DNS. So, basically it depends on what you're trying to verify and how forgiving you want to be, which is why most people don't do much verification. — gaige, Jun 08 '13 at 19:50
Fun. I might give up and just check for `^mailto:` and any `@` :) Thanks, and cheers! — Merlyn Morgan-Graham, Jun 08 '13 at 19:52

What are the valid characters in the domain part of e-mail address?

Intention

Problem

1 Answers1