Python: What is this encoding and how to decode?

Question

I have a lot of strings from mail bodies, that print as such:

=C3=A9

This should be 'é' for example.

What exactly is this encoding and how to decode it?

I'm using python 3.5

EDIT:

I managed to get the body of the mail properly encoded by applying:

quopri.decodestring(sometext).decode('utf-8')

However I still struggle to get the FROM , TO, SUBJECT, etc... parts get right.

This is how I construct the e-mails:

import imaplib
import email
import quopri


mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('mail@gmail.com', '*******')
mail.list()

mail.select('"[Gmail]/All Mail"') 



typ, data = mail.search(None, 'SUBJECT', '"{}"'.format('123456'))

data[0].split()

print(data[0].split())

for e_mail in data[0].split():
    typ, data = mail.fetch('{}'.format(e_mail.decode()),'(RFC822)')
    raw_mail = data[0][1]
    email_message = email.message_from_bytes(raw_mail)
    if email_message.is_multipart():
        for part in email_message.walk():
            if part.get_content_type() == 'text/plain':
                if part.get_content_type() == 'text/plain':
                    body = part.get_payload()
                    to = email_message['To']

                    utf = quopri.decodestring(to)

                    text = utf.decode('utf-8')
                    print(text)
.
.
.

I still got this: =?UTF-8?B?UMOpdGVyIFBldMWRY3o=?=

Try for ex. `'é'.encode().decode("utf-8")` – Srce Cde Nov 26 '18 at 17:00 — Srce Cde, Nov 26 '18 at 17:00

score 3 · Answer 1 · answered Nov 27 '18 at 01:51

That's called "quoted-printable" encoding. It's defined by RFC 1521. Its purpose is to replace unusual character values by a sequence of normal, safe characters so that the message can be handled safely by the email system.

In fact there are two levels of encoding here. First the letter 'é' was encoded into UTF-8 which produces '\xc3\xa9', and then that UTF-8 was encoded into the quoted-printable form '=C3=A9'

You can undo the quoted-printable step by using the decode or decodestring method of the quopri module, documented at https://docs.python.org/3/library/quopri.html That will look something like:

    import quopri

    source = '=C3=A9'
    print(quopri.decodestring(source))

That will undo the quoted-printable encoding and show you the UTF-8 bytes '\xc3\xa9'. To get back to the letter 'é' you need to use the decode string method and tell Python that those bytes contain a UTF-8 encoding, something like:

    utf = quopri.decodestring(source)
    text = utf.decode('utf-8')
    print(text)

UTF-8 is only one of many possible ways of encoding letters into bytes. For example, if your 'é' had been encoded as ISO-8859-1 it would have had the byte value '\xe9' and its quoted-printable representation would have been '=E9'.

When you're dealing with email, you should see a Content-Type header that tells you what type of content is being sent and which letter-to-bytes encoding was applied to the text of the message (or to an individual MIME part, in a multipart message). If that text was then encoded again by applying the quoted-printable encoding, that additional step should be indicated by a Content-Transfer-Encoding header. So your message with UTF-8 encoded text carried in quoted-printable format should have had headers that look like this:

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

Thank you ottomeister, you helped a lot. However I got stuck again. I added my question in the description above. I would appreciate any help a lot!!! — Peter Petocz, Nov 27 '18 at 14:25
The format used in email headers is called a "MIME encoded-word". It is defined by RFC 2047. The answers to https://stackoverflow.com/questions/7331351/python-email-header-decoding-utf-8 explain how to use the Python `email.header` module to decode that kind of header. — ottomeister, Nov 28 '18 at 21:13
Thank you again. I found out yesterday and wrote a little function to decode header items. — Peter Petocz, Nov 28 '18 at 21:37

score 2 · Accepted Answer · answered Nov 28 '18 at 21:39

This solved it:

from email.header import decode_header
def mail_header_decoder(self,header):
        if header != None:
            mail_header_decoded = decode_header(header)
            l=[]  
            header_new=[]
            for header_part in mail_header_decoded: 
                l.append(header_part[1])

            if all(item == None for item in l):
                # print(header)
                return header
            else:
                for header_part in mail_header_decoded:
                    header_new.append(header_part[0].decode())
                header_new = ''.join(header_new) # convert list to string
                # print(header_new)
                return header_new

Python: What is this encoding and how to decode?

2 Answers2

Related