Incorrect decoding in Python3 email module

Question

I have recently run into an EML file I wanted to parse with Python email module. In from header, there was following text:

From: "=?utf-8?b?5b2t5Lul5Zu9L+esrOS6jOS6i+S4mumDqOmhueebrumDqC/nrKzkuozkuovkuJrp?=
=?utf-8?b?g6g=?=" <email@address.com>

So the name is encoded in 2 parts. When I concatenate the code and decode this manually to hex, I get the following result, which is correct UTF-8 string:

e5 bd ad e4 bb a5 e5 9b bd 2f e7 ac ac e4 ba 8c e4 ba 8b e4 b8 9a e9 83 a8 e9 a1 b9 e7 9b ae e9 83 a8 2f e7 ac ac e4 ba 8c e4 ba 8b e4 b8 9a e9 83 a8

However, when I call the Python email Parser parse, the last 3 bytes are not decoded correctly. Instead, when I read the values of message['from'], there are surrogates:

dce9:20:dc83:dca8

So when I, for example, want to print the string, it ends up with

UnicodeEncodeError('utf-8', '彭以国/第二事业部项目部/第二事业\udce9\udc83\udca8', 17, 18, 'surrogates not allowed')

When I join the 2 encoded parts in From header into one, which looks like this:

From: "=?utf-8?b?5b2t5Lul5Zu9L+esrOS6jOS6i+S4mumDqOmhueebrumDqC/nrKzkuozkuovkuJrpg6g=?=" <email@address.com>

The string is decoded correctly by the library and can be printed just fine.

Is this a bug inside Python email module? Is the double-encoded value even permitted by EML standard?

Here is a sample EML file + Python code to reproduce the bad decoding (this does not actually trigger the exception, which happens later i.e. with SQLAlchemy not being able to encode the string back to UTF-8)

EML:

Content-Type: multipart/mixed; boundary="===============2193163039290138103=="
MIME-Version: 1.0
Date: Wed, 25 Aug 2018 19:21:23 +0100
From: "=?utf-8?b?5b2t5Lul5Zu9L+esrOS6jOS6i+S4mumDqOmhueebrumDqC/nrKzkuozkuovkuJrp?=
 =?utf-8?b?g6g=?=" <addr@addr.com>
Message-Id: <12312924463694945698.525C0AC435BA7D0E@xxxxx.com>
Subject: Sample subject
To: addr@addr.com

--===============2193163039290138103==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

VGhpcyBpcyBhIHNhbXBsZSB0ZXh0

--===============2193163039290138103==--

Python code:

from email.parser import Parser
from email import policy
from sys import argv


with open(argv[1], 'r', encoding='utf-8') as eml_file:
    msg = Parser(policy=policy.default).parse(eml_file)

print(msg['from'])

Result:

彭以国/第二事业部项目部/第二事业� ��

Martijn Pieters · Accepted Answer · 2018-12-21T11:39:29.347

This appears to be a problem with how the email.parser infrastructure is handling unfolding of multi-line headers containing encoded-word tokens for the From header and other structured headers. It does this correctly for unstructured headers such as Subject.

Your header has two encoded word parts, on two separate lines. This is perfectly normal, an encoded-word token has limited space (there is a maximum length limit) and so your UTF-8 data was split into two such words, and there is a line-separator plus space in-between. All great and fine. Whatever generated the email was wrong to split in the middle of a UTF-8 character (RFC2047 states that is strictly forbidden), a decoder of such data should not insert spaces between the decoded bytes. It is the extra space that then prevents the email header handling from joining the surrogates and repairing the data.

So this appears to be a bug in the way the headers are parsed when handling structured headers; the parser does not correctly handle spaces between encoded words, here the space was introduced by the folded header line. This then results in the space being preserved in between the two encoded-word parts, preventing proper decoding. So while RFC2047 does state that encoded-word sections MUST contain whole characters (multi-byte encodings must not be split), it also states that encoded words can be split up with CRLF SPACE delimiters and any spaces in between encoded words are to be ignored.

You can work around this by supplying a custom policy class, which removes the leading white space from lines in your own implementation of the Policy.header_fetch_parse() method.

import re
from email.policy import EmailPolicy

class UnfoldingEncodedStringHeaderPolicy(EmailPolicy):
    def header_fetch_parse(self, name, value):
        # remove any leading white space from header lines
        # that separates apparent encoded-word tokens before further processing 
        # using somewhat crude CRLF-FWS-between-encoded-word matching
        value = re.sub(r'(?<=\?=)((?:\r\n|[\r\n])[\t ]+)(?==\?)', '', value)
        return super().header_fetch_parse(name, value)

and use that as your policy when loading:

custom_policy = UnfoldingEncodedStringHeaderPolicy()

with open(argv[1], 'r', encoding='utf-8') as eml_file:
    msg = Parser(policy=custom_policy).parse(eml_file)

Demo:

>>> from io import StringIO
>>> from email.parser import Parser
>>> from email.policy import default as default_policy
>>> custom_policy = UnfoldingEncodedStringHeaderPolicy()
>>> Parser(policy=default_policy).parse(StringIO(data))['from']
'彭以国/第二事业部项目部/第二事业� �� <addr@addr.com>'
>>> Parser(policy=custom_policy).parse(StringIO(data))['from']
'彭以国/第二事业部项目部/第二事业部 <addr@addr.com>'

I filed Python issue #35547 to track this.

Encoded words are certainly completely permissible in `Subject:` and `From:`; this is what they were introduced for — tripleee, Dec 21 '18 at 05:46
Yup, and RFC 2047 explicitly named the phrases portion of From. I just wanted to cover all bases. I've removed that section now. — Martijn Pieters, Dec 21 '18 at 10:58
I have noticed that the current implementation also preserves the leading whitespace in the header values after unfolding. It only trims it from the first line. Is that expected? In other words, can the parsed header value start with a space? If yes, then such a behavior is correct but I was not able to find a definitive source that would confirm that. Some e-mail clients, e.g. Thunderbird and Evolution, seem to trim it. — Peter Bašista, Mar 12 '21 at 23:37
@PeterBašista email header parser is a veritable *quagmire* of complex rules and many, _many_ bad implementations out in the wild that complicate parsing. At this point I have no idea what is expected of leading whitespace in multi-line header values and lack the context and will to dig into this, sorry. — Martijn Pieters, Mar 13 '21 at 08:59
@MartijnPieters I understand, thank you. My context was that e-mails whose `Subject:` header contains only a single space after the colon in the first line and the remaining content on the next lines would be parsed by `email.parser` module with the default policy as having a subject which starts with a space. Anyway, maybe someone who is familiar with the related RFCs can explain what the correct behavior should be. — Peter Bašista, Mar 15 '21 at 07:27

Incorrect decoding in Python3 email module

1 Answers1