57

is there any Python module which helps to decode the various forms of encoded mail headers, mainly Subject, to simple - say - UTF-8 strings?

Here are example Subject headers from mail files that I have:

Subject: [ 201105311136 ]=?UTF-8?B?IMKnIDE2NSBBYnM=?=. 1 AO;
Subject: [ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=
Subject: [ 201105191633 ]
  =?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=
  =?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?=

text - encoded sting - text

text - encoded string

text - encoded string - encoded string

Encodig could also be something else like ISO 8859-15.

Update 1: I forgot to mention, I tried email.header.decode_header

    for item in message.items():
    if item[0] == 'Subject':
            sub = email.header.decode_header(item[1])
            logging.debug( 'Subject is %s' %  sub )

This outputs

> DEBUG:root:Subject is [('[ 201101251025 ]
> ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

which does not really help.

Update 2: Thanks to Ingmar Hupp in the comments.

the first example decodes to a list of two tupels:

> >>> print decode_header("""[ 201105161048 ]
> GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")  
> [('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorl\xc3\xa4ufigkeit',
> 'utf-8')]

is this always [(string, encoding),(string, encoding), ...] so I need a loop to concat all the [0] items to one string or how to get it all in one string?

> Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011

does not decode well:

> print decode_header("""[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011""")
>
>[('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]
Michael M.
  • 10,486
  • 9
  • 18
  • 34
Hans Moser
  • 571
  • 1
  • 4
  • 4
  • 1
    I think `make_header(decode_header(subject))` is the simplest solution. See docs for make_header(): https://docs.python.org/2/library/email.header.html#email.header.make_header – guettli Jul 27 '16 at 08:58

9 Answers9

63

This type of encoding is known as MIME encoded-word and the email module can decode it:

from email.header import decode_header
print decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""")

This outputs a list of tuples, containing the decoded string and the encoding used. This is because the format supports different encodings in a single header. To merge these into a single string you need to convert them into a shared encoding and then concatenate this, which can be accomplished using Python's unicode object:

from email.header import decode_header
dh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
default_charset = 'ASCII'
print ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ])

Update 2:

The problem with this Subject line not decoding:

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011
                                                                     ^

Is actually the senders fault, which violates the requirement of encoded-words in a header being separated by white-space, specified in RFC 2047, section 5, paragraph 1: an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

If need be, you can work around this by pre-processing these corrupt headers with a regex that inserts a whitespace after the encoded-word part (unless it's at the end), like so:

import re
header_value = re.sub(r"(=\?.*\?=)(?!$)", r"\1 ", header_value)
Community
  • 1
  • 1
Ingmar Hupp
  • 2,409
  • 18
  • 22
  • the first example decodes to a list of two tupels: `>>> print decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""") [('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorl\xc3\xa4ufigkeit', 'utf-8')]` ist this always [(string, encoding),(string, encoding), ...] so I need a loop to concat all the [0] items to one string or how to get it all in one string? – Hans Moser Sep 07 '11 at 10:10
  • Yes, but you need to take into account that they are (or can be) using different encodings, so a bit of conversion is required. Updated answer with example. – Ingmar Hupp Sep 07 '11 at 10:48
  • Thanks for your answer. The wrong encoding ofd encoded-word is really odd, because the mail header was created by perl MIME::Lite. – Hans Moser Sep 07 '11 at 11:36
  • The update2 will broken for such subject =?utf-8?Q?=E5=9B=9E=E5=A4=8D=EF=BC=9A_=E6=94=B6=E4=BB=B6=E7=AE=B1=5F03?= – jjyao Jan 15 '13 at 06:35
  • I think you should use `' '.join` instead of `''.join` to add the spacing between the words. – theomega Feb 19 '13 at 14:34
  • The python email library looks like a low level library to me. A bit like assembler ... I would like to avoid it, but don't know a better solution :-( -- don't get me wrong: Thank you for sharing your knowledge, it helped me to get things done faster. – guettli May 13 '16 at 06:57
  • 2
    @guettli: On Python 3, [`str(make_header(decode_header(subject)))`](http://stackoverflow.com/a/21715870/4279) works for all examples from the question (no need for `re.sub`, `''.join`) (it is 3 calls instead of 1 but it is not that bad). – jfs Jul 07 '16 at 11:15
  • @J.F.Sebastian unfortunately I will still need to support Python 2.7 – guettli Jul 11 '16 at 08:09
60

I was just testing with encoded headers in Python 3.3, and I found that this is a very convenient way to deal with them:

>>> from email.header import Header, decode_header, make_header

>>> subject = '[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?='
>>> h = make_header(decode_header(subject))
>>> str(h)
'[ 201105161048 ] GewSt:  Wegfall der Vorläufigkeit'

As you can see it automatically adds whitespace around the encoded words.

It internally keeps the encoded and ASCII header parts separate as you can see when it re-encodes the non-ASCII parts:

>>> h.encode()
'[ 201105161048 ] GewSt: =?utf-8?q?_Wegfall_der_Vorl=C3=A4ufigkeit?='

If you want the whole header re-encoded you could convert the header to a string and then back into a header:

>>> h2 = Header(str(h))
>>> str(h2)
'[ 201105161048 ] GewSt:  Wegfall der Vorläufigkeit'
>>> h2.encode()
'=?utf-8?q?=5B_201105161048_=5D_GewSt=3A__Wegfall_der_Vorl=C3=A4ufigkeit?='
Sander Steffann
  • 9,509
  • 35
  • 40
  • 7
    This is the right answer, the documentation in [`email.header`](https://docs.python.org/3.3/library/email.header.html) doesn't make this clear, but `make_header(decode_header())` is how to properly decode email headers. – dimo414 Jul 23 '14 at 19:00
  • 6
    This technique appears to work for Python 2.7, too, but using `unicode()` instead of `str()` to convert the header object back to a (unicode) string. – Isaac Jan 19 '16 at 06:50
  • Thank you! Saved me 30 minutes of wondering through the official docs. :) – Vladimir Obrizan Aug 10 '23 at 11:59
6
def decode_header(value):
    return ' '.join((item[0].decode(item[1] or 'utf-8').encode('utf-8') for item in email.header.decode_header(value)))
Vitaly Greck
  • 658
  • 5
  • 9
5

How about decoding headers in the following way:

import poplib, email

from email.header import decode_header, make_header

...

        subject, encoding = decode_header(message.get('subject'))[0]

        if encoding==None:
            print "\n%s (%s)\n"%(subject, encoding)
        else:
            print "\n%s (%s)\n"%(subject.decode(encoding), encoding)

this gets subject from email and decodes it with specified encoding (or no decoding if encoding set to None).

Worked for me for encodings set as 'None', 'utf-8', 'koi8-r', 'cp1251', 'windows-1251'

Matt Fenwick
  • 48,199
  • 22
  • 128
  • 192
Eugene
  • 51
  • 1
  • 1
  • why the [0] ? decode_header return two arguments, you want to put both in subject and encoding respectively, but [0] seems that you want to take just first argument – Hugo Ferreira Apr 01 '17 at 13:56
4

I had a similar issue, but my case was a little bit different:

  • Python 3.5 (The question is from 2011, but still very high on google)
  • Read message directly from file as byte-string

Now the cool feature of the python 3 email.parser is that all headers are automatically decoded to Unicode-Strings. However this causes a little "misfortune" when dealing with wrong headers. So following header caused the problem:

Subject: Re: =?ISO-2022-JP?B?GyRCIVYlMyUiMnE1RCFXGyhC?=
 (1/9(=?ISO-2022-JP?B?GyRCNmIbKEI=?=) 6:00pm-7:00pm) 
 =?ISO-2022-JP?B?GyRCJE4kKkNOJGkkOxsoQg==?=

This resulted in the following msg['subject']:

Re: 「コア会議」 (1/9(=?ISO-2022-JP?B?GyRCNmIbKEI=?=) 6:00pm-7:00pm)  のお知らせ

Well the issue is uncompliance with RFC 2047 (There should be a line-white-space after the MIME encoded word) as already described in the answer by Ingmar Hupp. So my answer is inspired by his.

Solution 1: Fix byte-string before actually parsing the email. This seemed to be the better solution, however I was struggling to implement a Regex substitution on byte-strings. So I opted for solution 2:

Solution 2: Fix the already parsed and partly-decoded header value:

with open(file, 'rb') as fp:  # read as byte-string
    msg = email.message_from_binary_file(fp, policy=policy.default)
    subject_fixed = fix_wrong_encoded_words_header(msg['subject'])


def fix_wrong_encoded_words_header(header_value):
    fixed_header_value = re.sub(r"(=\?.*\?=)(?=\S)", r"\1 ", header_value)

    if fixed_header_value == header_value:  # nothing needed to fix
        return header_value
    else:
        dh = decode_header(fixed_header_value) 
        default_charset = 'unicode-escape'
        correct_header_value = ''.join([str(t[0], t[1] or default_charset) for t in dh])
        return correct_header_value

Explanation of important parts:

I modified the regex of Ingmar Hupp to only replace wrong MIME encoded words: (=\?.*\?=)(?=\S) Debuggex Demo. Because doing for all would heavily slow dow the parsing (Parsing about 150'000 mails).

After applying the decode_header function to the fixed_header, we have following parts in dh:

dh == [(b'Re: \\u300c\\u30b3\\u30a2\\u4f1a\\u8b70\\u300d (1/9(', None), 
       (b'\x1b$B6b\x1b(B', 'iso-2022-jp'), 
       (b' ) 6:00pm-7:00pm)  \\u306e\\u304a\\u77e5\\u3089\\u305b', None)]

To re-decode the unicode-escaped sequences, we set default_charset = 'unicode-escape' when building the new header-value.

The correct_header_value is now:

Re: 「コア会議」 (1/9(金 ) 6:00pm-7:00pm)  のお知らせ'

I hope this will save somebody some time.

Addition: The answer by Sander Steffann didn't really help me, because I wasn't able to get the raw-value of the header-field out of the message-class.

Community
  • 1
  • 1
Luke
  • 770
  • 2
  • 8
  • 17
2

This script works fine for me.. I use this script to decode all email subjects

pat2=re.compile(r'(([^=]*)=\?([^\?]*)\?([BbQq])\?([^\?]*)\?=([^=]*))',re.IGNORECASE)

def decodev2(a):
    data=pat2.findall(a)
    line=[]
    if data:
            for g in data:
                    (raw,extra1,encoding,method,string,extra)=g
                    extra1=extra1.replace('\r','').replace('\n','').strip()
                    if len(extra1)>0:
                            line.append(extra1)
                    if method.lower()=='q':
                            string=quopri.decodestring(string)
                            string=string.replace("_"," ").strip()
                    if method.lower()=='b':
                            string=base64.b64decode(string)
                    line.append(string.decode(encoding,errors='ignore'))
                    extra=extra.replace('\r','').replace('\n','').strip()
                    if len(extra)>0:
                            line.append(extra)
            return "".join(line)
    else:
            return a

samples:

=?iso-8859-1?q?una-al-dia_=2806/04/2017=29_Google_soluciona_102_vulnerabi?=
 =?iso-8859-1?q?lidades_en_Android?=

=?UTF-8?Q?Al=C3=A9grate?= : =?UTF-8?Q?=20La=20compra=20de=20tu=20vehi?= =?UTF-8?Q?culo=20en=20tan=20s=C3=B3lo=2024h?= =?UTF-8?Q?=2E=2E=2E=20=C2=A1Valoraci=C3=B3n=20=26?= =?UTF-8?Q?ago=20=C2=A0inmediato=21?=
Abhishek Gurjar
  • 7,426
  • 10
  • 37
  • 45
Jescolabcn
  • 21
  • 1
  • This works fine for some emails, but for instance `=?utf-8?Q?1=20underlig=20opfindelse=20til=20at=20forbr=C3=A6nde=20din=20v=C3=A6gt=20hurtigt=20og=20nemt=20...?=` is decoded to `b'1 underlig opfindelse til at forbr\xc3\xa6nde din v\xc3\xa6gt hurtigt og nemt ...'` and not `1 underlig opfindelse til at forbrænde din vægt hurtigt og nemt ...` – JoSSte Jun 16 '20 at 10:15
2
from email.header import decode_header
mail = email.message_from_bytes(data[0][1])

subject_list = decode_header(mail['subject'])

sub_list = []
for subject in subject_list:
    if subject[1]:
        subject = (subject[0].decode(subject[1]))
    elif type(subject[0]) == bytes:
        subject = subject[0].decode('utf-8')
    else:
        subject = subject[0]
    sub_list.append(subject)

subject = ''.join(sub_list)
print('Subject:' + subject)
Shmidt
  • 16,436
  • 18
  • 88
  • 136
-1

for me this worked perfect (and always gives me a string):

dmsgsubject, dmsgsubjectencoding = email.header.decode_header(msg['Subject'])[0]
msgsubject = dmsgsubject.decode(*([dmsgsubjectencoding] if dmsgsubjectencoding else [])) if isinstance(dmsgsubject, bytes) else dmsgsubject
ginger
  • 271
  • 1
  • 2
  • 9
-1

Python has an e-mail lib. http://docs.python.org/library/email.header.html

Take a look at email.header.decode_header()

Antony Woods
  • 4,415
  • 3
  • 26
  • 47