3

In my MUA (Thunderbird 15.0.1) both mail subjects are displayed like this:

Keine Mail zu "Abschlagsänderung" gefunden

Here is a snippet to reproduce it:

import email

for subject in ['Subject: Re: Keine Mail zu "=?utf-8?q?Abschlags=C3=A4nderung?=" gefunden',
                'Subject: =?utf-8?q?Keine_Mail_zu_=22Abschlags=C3=A4nderung=22_gefunden?=']:
    msg=email.message_from_string(subject)
    print email.Header.decode_header(msg.get('subject'))

Output:

[('Re: Keine Mail zu "=?utf-8?q?Abschlags=C3=A4nderung?=" gefunden', None)]
[('Keine Mail zu "Abschlags\xc3\xa4nderung" gefunden', 'utf-8')]

The first header can't be parsed by python, but thunderbird does. It was created by KMail/1.11.4

How can I parse the first header with umlauts in Python 2.7?

guettli
  • 25,042
  • 81
  • 346
  • 663
  • 1
    Related: [email header decoding UTF-8](http://stackoverflow.com/questions/7331351/python-email-header-decoding-utf-8) – Ivan Chau Sep 13 '15 at 07:29

1 Answers1

2

According to RFC 2047,

An 'encoded-word' MUST NOT appear within a 'quoted-string'.

A 'quoted-string' according to RFC 822 is

quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or quoted chars.

So I think the Python library is right, as

"=?utf-8?q?Abschlags=C3=A4nderung?="

is a quoted string. A better alternative with minimal quoting would be

=?utf-8?q?=22Abschlags=C3=A4nderung=22?=

having the " encoded as =22.

You could parse them by replacing the " with =?utf-8?q?=22?=:

>>> email.Header.decode_header('=?utf-8?q?=22?= =?utf-8?q?Abschlags=C3=A4nderung?= =?utf-8?q?=22?=')
[('"Abschlags\xc3\xa4nderung"', 'utf-8')]
Community
  • 1
  • 1
glglgl
  • 89,107
  • 13
  • 149
  • 217
  • Thank you very much for this answer. Since it is a bug in KMail, and this MUA is not very wide spread, I will leave my code like it is. – guettli Oct 17 '12 at 19:10
  • I came across this bug in KMail again. The bug in KMail is still open and several years old: https://bugs.kde.org/show_bug.cgi?id=69007 – guettli May 14 '13 at 13:17