3

I am sending email with EmailMessage object to Gmail box.
The subject of an email looks something like this: u"You got a letter from Daėrius ęėįęėįęįėęįę---reply3_433441"

When i receive an email, looking at the message info i can see that Subject line looks like this:

Subject: =?utf-8?b?WW91IGdvdCBhIGxldHRlciBmcm9tIERhxJdyaXVzIMSZxJfEr8SZxJfEr8SZ?= =?utf-8?b?xK/El8SZxK/EmS0tLXJlcGx5M180MzM0NDE=?=

How to decode this subject line?

I have sucesfully decoded email body (tex/plain) with this:

for part in msg.walk():
  if part.get_content_type() == 'text/plain':
    msg_encoding = part.get_content_charset()
    msg_text = part.get_payload().decode('quoted-printable')
msg_text = smart_unicode(msg_text, encoding=msg_encoding, strings_only=False, errors='strict') 
Darius
  • 1,150
  • 10
  • 12

3 Answers3

4

See RFC 2047 for a complete description of the format of internationalized email headers. The basic format is "=?" charset "?" encoding "?" encoded-text "?=". So in your case, you have a base-64 encoded UTF-8 string.

You can use the email.header.decode_header and str.decode functions to decode it and get a proper Unicode string:

>>> import email.header
>>> x = email.header.decode_header('=?utf-8?b?WW91IGdvdCBhIGxldHRlciBmcm9tIERhxJdyaXVzIMSZxJfEr8SZxJfEr8SZ?=')
>>> x
[('You got a letter from Da\xc4\x97rius \xc4\x99\xc4\x97\xc4\xaf\xc4\x99\xc4\x97\xc4\xaf\xc4\x99', 'utf-8')]
>>> x[0][0].decode(x[0][1])
u'You got a letter from Da\u0117rius \u0119\u0117\u012f\u0119\u0117\u012f\u0119'
Community
  • 1
  • 1
Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • Thanks alot. Works like a charm. I just had to manually split the Subject line into several parts, as each of that part had to be manually decoded. Finished my task in less than 15 minutes with your example. – Darius Mar 18 '11 at 06:10
3

You should look at the email.header module in the Python standard library. In particular, at the end of the documentation, there's a decode_header() function you can use to do most of the hard work for you.

André Caron
  • 44,541
  • 12
  • 67
  • 125
0

the subject line is utf8 but you're reading it as ASCII, you're safest reading it all as utf8, as ASCII is effectively only as subset of utf8.

theheadofabroom
  • 20,639
  • 5
  • 33
  • 65