6

Using the email and smtplib modules in Python 3.x, after a good amount of research, I can send emails with Unicode subjects, text bodies, and names (for both the sender and the recipients), which is awesome, but it won't let me send emails to addresses that themselves contain Unicode (or other non-ASCII) characters. It doesn't seem to be supported (if you look at the comments in email.utils it says as much: i.e. "The address MUST (per RFC) be ascii, so raise a UnicodeError if it isn't.") Any attempts to do it anyway (including, but not only, BCC recipients—in an effort to maybe bypass any message header limitations) have failed with one form of Unicode error or another. The comment doesn't say which RFC (I don't think they all specify that email addresses should use ASCII-only.)

Is there another way to do this, seeing as addresses like this are rumored to be able to exist in some places: úßerñame@dómain.com? I mean, are there other email modules that do support it?

If the premise of my question is incorrect, are email addresses intended to become ASCII-only for the whole world (despite how some of them are rumored to use other characters)?

I see this question for other languages, but not for Python.

Brōtsyorfuzthrāx
  • 4,387
  • 4
  • 34
  • 56
  • 1
    Email addresses aren't "intended to become ASCII-only". They were originally ASCII-only, and they're intended to become more friendly to the rest of the world, but it's been a slow transition. – abarnert Sep 02 '18 at 03:34
  • 2
    Well, `dómain.com` is just `xn--dmain-0ta.com` per RFC 3492. – Dima Chubarov Sep 02 '18 at 03:43
  • I've discovered that if I set my server object's command_encoding (smtplib.SMTP.command_encoding) to "utf-8" (it's set to "ascii"), it'll act like it's sending BCC emails with special characters. Of course, I don't know if they'll be received or not, but it sends. – Brōtsyorfuzthrāx Sep 02 '18 at 03:50
  • I don't know if it's smtplib or Gmail's SMTP server, but after modifying email.utils.formataddr to accept UTF-8 email addresses, it usually says it sends the emails, but looking in my sent folder on Gmail, it doesn't appear to be sending the emails to the UTF-8 addresses, and in one case I get the following exception: `smtplib.SMTPRecipientsRefused` (I'm not sure if that's due to my SMTP server or because UTF-8 really isn't supported on `smtplib`). – Brōtsyorfuzthrāx Sep 02 '18 at 04:25

1 Answers1

17

are email addresses intended to become ASCII-only for the whole world?

No; in fact, the exact opposite. Email address were ASCII-only. They're intended to become Unicode, and we're on the way there; it's just been a slow transition.


In modern email, there are two parts to an email address:1 the DNS hostname (the part after the @), and the mailbox on that host (the part before the @). They're governed by entirely different standards, because DNS has to work for HTTP and all kinds of other things besides just email.


DNS was last updated back in 1987 in RFC 1035, which mandates a restricted subset of ASCII (and also case-insensitivity).

However, IDNA (Internationalized Domain Names for Applications), specified in RFC 5890, allows applications to optionally map a much larger part of the Unicode character set to DNS names for presentation to the user.

So, you cannot have the domain name dómain.com. But you can have the domain name xn--dmain-0ta.com. And many applications will accept dómain.com from user input and translate it automatically, and accept xn--dmain-0ta.com from the network and display it to dómain.com.2

In Python, some libraries for internet protocols will automatically IDNA-encode domain names for you; otherwise will not. If they don't, you can do it manually, like this:

>>> 'dómain.com'.encode('idna')
b'xn--dmain-0ta.com'

Notice that in 3.x, this is a bytes, not a str; if you need a str, you can always do this:

>>> 'dómain.com'.encode('idna').decode('ascii')
'xn--dmain-0ta.com'

Mailbox names are defined by SMTP, most recently defined in RFC 5321 and RFC 5322, which make it clear that it's entirely up to the receiving host how to interpret the "local part" of an address. For example, most email servers use case-insensitive names; many allow "plus-tagging" (so, e.g., shule@gmail.com and shule+so@gmail.com are the same mailbox); some (like gmail) ignore all dots; etc.

The problem is that SMTP has never specified what character set is in use for the headers. Traditional SMTP servers were 7-bit ASCII only, so, practically, until recently, you could only use ASCII in the headers, and therefore in the mailbox names.

EAI (Email Address Internationalization), as specified in RFC 6530 and related proposals, allows negotiating UTF-8 in SMTP sessions. In a UTF-8 session, the headers, and the addresses in those headers, are interpreted as UTF-8. (IDNA-encoding of the hostname is not required but still allowed.)

That's great, but what if your client, your server, your recipient's server, or any relaying servers along the way don't speak SMTPUTF8? To handle that case, everyone who has a UTF-8 mailbox also has an ASCII name for that mailbox. Ideally that gets sent along with the message, and the last SMTPUTF8 program on the chain switches to the ASCII substitute when it meets the first non-SMTPUTF8 program. More commonly, it just gets an error message and propagates it back to the user to deal with manually.3

The idea is that eventually, most hosts on the internet will speak SMTPUTF8, so you can be úßerñame@dómain.com—but meanwhile, your server on dómain.com has úßerñame and ussernyame as aliases to the same mailbox. Anyone who can't handle SMTPUTF8 will see you (and have to refer to you) as ussernyame. (Their mail client will, in fact, see you as ussernyame@xn--dmain-0ta.com, but it can fix that last part; there's nothing it can do about the first part if it was lost in transport.)

As of mid-2018, most hosts don't speak SMTPUTF8, and neither do many client libraries.

As of Python 3.5,4 the standard library's smtplib supports SMTPUTF8. If you're using the high-level sendmail function:

If SMTPUTF8 is included in mail_options, and the server supports it, from_addr and to_addrs may contain non-ASCII characters.

So, what you do is something like this:

try:
    server.sendmail([fromaddr], [toaddr], msg, mail_options=['SMTPUTF8'])
except SMTPNotSupportedError:
    server.sendmail([fromaddr_ascii], [toaddr_ascii], msg)

(In theory it's better to check the EHLO response with has_extn, but in practice, just trying it seems to worth more smoothly. That may change with future improvements in the server ecosystem and/or smptlib.)

Where do you get that fromaddr_ascii and toaddr_ascii? That's up to your program. The DNS part, you just use IDNA, but for the mailbox part, there is no such rule; you have to know the mailbox's alternate ASCII mailbox name. Maybe you ask the user. Maybe you have a database that stores contacts with both EAI and traditional addresses. Maybe you're only worried about one specific domain and you know that it uses some rule that you can implement.


1. Actually, there are two parts to an addr-spec; an address is an addr-spec plus optional display name and comment. But never mind that.

2. There are a few exceptions. For example, if you type http://staсkoverflow.com, your browser might warn you that the Cyrillic lowercase Es in place of a Latin lowercase Cee might be a hijacking attempt. Or, if you try to navigate to http://dómain.com, the error page telling you that the domain doesn't exist will probably show you xn--dmain-0ta.com, because that's more useful for debugging.

3. This is one of those things that will hopefully get better over time, but may well not get good enough until after it doesn't matter anymore anyway…

4. What if you're using Python 3.4 or 2.7? Then you don't have SMTPUTF8 support. Upgrade, go find a third-party library instead of smtplib, or write your own SMTP code.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Do the domains in both the message header and toaddr/fromaddr need to be idna-style, or just the header? And do you still use the idna-style text on domains for where SMTPUTF8 is supported (if you don't need to, does it hurt)? – Brōtsyorfuzthrāx Sep 02 '18 at 05:15
  • 1
    @Shule The relevant spec is [section 3.2 of RFC 6531](https://tools.ietf.org/html/rfc6531#section-3.2): if your software is acting as a client, it "may transmit the domain parts of mailbox names within SMTP commands or the message header as A-labels or U-labels". In other words, you can use Unicode domains in both the commands and the headers without doing IDNA. However, you might want to do it anyway, because some servers may handle it better in practice. (For example, `exim` handles them if it's configured to, but IIRC, the version in the Ubuntu 16.04 LTS was not.) – abarnert Sep 02 '18 at 08:52
  • 1
    @Shule And for the last part: it should never hurt delivery to IDNA the domain names even when you don't need to. However, it does mean that if the recipient's client software doesn't know IDNA, it's not going to show the addresses nicely. (Traditional client software that doesn't know IDNA probably doesn't know UTF-8 either, so better IDNA than mojibake… but webmail clients could be a different story.) – abarnert Sep 02 '18 at 08:55
  • Well, it seems to be sending, now. I still need to try formatting the recipients, too, to see if that'll fix my message header problems in my Gmail sent folder, though (which only seem to be a problem with BCC and where if I have both Unicode recipient names and Unicode email addresses—instead of one or the other). I did have to use SMTPUTF8 and my amended version of email.utils.formataddr (but setting command_encoding doesn't seem necessary while using SMTPUTF8). I had to use IDNA domains for reply-to emails when sending to my Yandex email (to be able to reply to them from Yandex). – Brōtsyorfuzthrāx Sep 04 '18 at 13:41
  • Oh, I also used msg.add_header for all the email fields instead of assigning values like msg["To"]=addresses. That supports UTF-8 by default, although I don't remember how necessary that is when using formataddr in this case. – Brōtsyorfuzthrāx Sep 04 '18 at 13:43