9

I want to send email messages that have arbitrary unicode bodies in a Python 3.2 program. But, in reality, these messages will consist largely of 7bit ASCII text. So I would like the messages encoded in utf-8 using quoted-printable. So far, I've found this works, but it seems wrong:

c = email.charset.Charset('utf-8')
c.body_encoding = email.charset.QP
m = email.message.Message()
m.set_payload("My message with an '\u05d0' in it.".encode('utf-8').decode('iso8859-1'), c)

This results in an email message with exactly the right content:

To: someone@example.com
From: someone_else@example.com
Subject: This is a subjective subject.
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

My message with an '=D7=90' in it.

In particular b'\xd7\x90'.decode('utf-8') results in the original Unicode character. So the quoted-printable encoding is properly rendering the utf-8. I'm well-aware that this is an incredibly ugly hack. But it works.

This is Python 3. Text strings are expected to always be unicode. I shouldn't have to decode it to utf-8. And then turning it from bytes back into str by .decode('iso8859-1') is a horrible hack, and I shouldn't have to do that either.

It the email module just broken with respect to encodings? Am I not getting something?

I've attempted to just plain old set it, with no character set. That leaves me with a unicode email message, and that's not right at all. I've also tried leaving off the encode and decode steps. If I leave them both off, it complains that the \u05d0 is out-of-range when trying to decide if that character needs to be quoted in the quoted-printable encoding. If I leave in just the encode step, it complains bitterly about how I'm passing in a bytes and it wants a str.

Omnifarious
  • 54,333
  • 19
  • 131
  • 194
  • If `"My message with an '\u05d0' in it."` is the unicode you desire, then you can not use `"My message with an '\u05d0' in it.".encode('utf-8').decode('iso8859-1')` since this a different unicode. (You will have altered the message.) – unutbu Feb 22 '12 at 21:52
  • @unutbu: Congratulations for spotting why the code is very ugly. But it works. It achieves the desired result. See my update. – Omnifarious Feb 22 '12 at 21:58
  • For Python 3.6+ see also now https://stackoverflow.com/questions/66039715/python3-email-message-to-disable-base64-and-remove-mime-version/66041936#66041936 – tripleee Feb 04 '21 at 08:33

2 Answers2

10

That email package isn't confused about which is which (encoded unicode versus content-transfer-encoded binary data), but the documentation does not make it very clear, since much of the documentation dates from an era when "encoding" meant content-transfer-encoding. We're working on a better API that will make all this easier to grok (and better docs).

There actually is a way to get the email package to use QP for utf-8 bodies, but it isn't very well documented. You do it like this:

>>> charset.add_charset('utf-8', charset.QP, charset.QP)
>>> m = MIMEText("This is utf-8 text: á", _charset='utf-8')
>>> str(m)
'Content-Type: text/plain; charset="utf-8"\nMIME-Version: 1.0\nContent-Transfer-Encoding: quoted-printable\n\nThis is utf-8 text: =E1'
  • Thank you! This answers my question perfectly and gives me a way to do what I want that is not a disturbing hack. :-) – Omnifarious Mar 03 '12 at 01:43
  • 1
    That handles your character just fine. But it does not handle the character \u05d0. In fact, it doesn't encode your character as utf-8, it encodes it as iso8859-1. :-/ – Omnifarious Mar 03 '12 at 01:50
  • it fails for `'body …'`. It produces `'body =3DE2=3D80=3DA6'` instead of `'body=20=E2=80=A6'` in Python 3.3. And the same code fails on Python 3.4 with `UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 5: surrogates not allowed` – jfs Mar 09 '14 at 05:12
  • This is an excellent solution for Python 2.7, as it allows me to support unicode in my outbound emails and still allows me to include assertions in my test suite on the contents of email notifications (which is more difficult if base64 encoded). – jdhildeb Nov 01 '21 at 14:58
1

Running

import email
import email.charset
import email.message

c = email.charset.Charset('utf-8')
c.body_encoding = email.charset.QP
m = email.message.Message()
m.set_payload("My message with an '\u05d0' in it.", c)
print(m.as_string())

Yields this traceback message:

  File "/usr/lib/python3.2/email/quoprimime.py", line 81, in body_check
    return chr(octet) != _QUOPRI_BODY_MAP[octet]
KeyError: 1488

Since

In [11]: int('5d0',16)
Out[11]: 1488

it is clear that the unicode '\u05d0' is the problem character. _QUOPRI_BODY_MAP is defined in quoprimime.py by

_QUOPRI_HEADER_MAP = dict((c, '=%02X' % c) for c in range(256))
_QUOPRI_BODY_MAP = _QUOPRI_HEADER_MAP.copy()

This dict only contains keys from range(256). So I think you are right; quoprimime.py can not be used to encode arbitrary unicode.

As a workaround, you could use (the default) base64 by omitting

c.body_encoding = email.charset.QP

Note that the latest version of quoprimime.py does not use _QUOPRI_BODY_MAP at all, so using the latest Python might fix the problem.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 2
    I suspect it won't. The problem seems to be not properly converting to utf-8 bytes before applying the quoted-printable encoding. The `as_string` and `__str__` methods of `email.message.Message` should be deprecated in favor of methods that return bytes instead. I'm guessing the whole email package is a bit confused about the difference between the binary encoding done on an email message and the encoding implied by using a particular character encoding system. Those two are actually separate concepts even though they both use the term 'encoding'. – Omnifarious Feb 22 '12 at 22:11