imaplib with Python 3.7.4 occasionally returns an attachment that fails to be decoded

Question

Some background:

imaplib with Python 3.7.4 occasionally returns a photo attachment (jpg) that fails to be decoded from the server after being downloaded. I've confirmed that the photos are encoded when sent with byte64 encoding over multiple emails. Most Photos work; however, certain ones don't for whatever reason. At this time, I don't know which email client is being used to send this particular email that causes the crash or the source of the photo (phone, camera, pc, etc). I've tested every supported file type from python-pillow without any issues though. It's just this one photo/email. And lastly, if I remove the attachment there are no issues, so it's something to do with the photo. All python packages are the current versions.

The commented lines in the code below show things I've tried the following encodings:

utf-8 (which fails to decode it at all)

Traceback (most recent call last): File "(file path)", line 514, in DoEmail

raw_email_string = raw_email.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 10922: invalid start byte

cp1252 (Which returns a NoneType object when trying to save the file.)

Traceback (most recent call last):

part.get_payload(decode=True))

TypeError: a bytes-like object is required, not 'NoneType'

I've looked at the documentation for email.parser Source and email.parser Docs and imaplib Docs. Also a good example by MattH and attachment example by John Paul Hayes.

My Question:

Why do certain photos, even though they seem to be encoded correctly, cause it to crash? And how do I fix it? Is there a better method to get and save the attachments?

Relevant Code:

# Site is the email server address
# Port is the email server port, usually 993.
mail = imaplib.IMAP4_SSL(host=Site, port=Port) # imaplib module implements connection based on IMAPv4 protocol
mail.login(Email, password)
mail.select('inbox', readonly=False) # Connected to inbox.
# SearchPhrase is the Phrase used when finding unique emails.
result, data = mail.uid('SEARCH', None, f'Subject "{SearchPhrase}"') # search and return uids instead 
if result == 'OK':
    EmailIdList = data[0].split() # EmailIdList is a space separated byte string of the ids
    count = len(EmailIdList)
    for x in range(count): 
        if GUI: GUI.resultStatus = resx.currentProgress(x+1, count)
        latest_email_uid = EmailIdList[x] # unique ids wrt label selected
        EmailID = latest_email_uid.decode('utf-8')
        result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
        if result == 'OK':

            raw_email = email_data[0][1]

#                try:
#                    raw_email_string = raw_email.decode('utf-8')
#                except:
#                    raw_email_string = raw_email.decode('cp1252')
#                email_message = email.message_from_string(raw_email)

            email_message = email.message_from_bytes(raw_email)
            print(email_message)
            dt = parse(email_message['Date']) #dateutil.parser.parse()
            day = str(dt.strftime("%B %d, %Y")) #date())
            msg.get_content_charset(), 'ignore').encode('utf8', 'replace')

            # this will loop through all the available multiparts in email
            for part in email_message.walk():
                charset = part.get_content_charset()
                if part.get_content_maintype() != 'multipart' and part.get('Content-Disposition') is not None:
                    fileName = part.get_filename().replace('\n','').replace('\r','')
                    if fileName != '' and fileName is not None:
                        print(fileName)
                        with open(fileName, 'wb') as f: 
                            ########  ---- HERE ---- ##########
                            f.write(part.get_payload(decode=True))
                elif part.get_content_type() == "text/plain": # get only text/plain 
                    body = str(part.get_payload(decode=True), str(charset), "ignore").replace('\r','')
                    print(body)

                elif part.get_content_type() == "text/html": # get only html
                    html = str(part.get_payload(decode=True), str(charset), "ignore").replace('\n', '').replace('\r', ' ')
                    print(html)
                else:
                    continue

Edit: I believe these are the MIME Headers for the image in question.

------=_NextPart_000_14A6_01D55B4C.3FE8C840

Content-Type: image/jpeg;

name="8~a~0ff68d6a-12aa-49bf-9908-0b28ecd7ec83~634676194557918023.jpg"

Content-Transfer-Encoding: base64

Content-Disposition: attachment;

filename="8~a~0ff68d6a-12aa-49bf-9908-0b28ecd7ec83~634676194557918023.jpg"

Edit: The location of the crash (when it decodes the byte64 data to save the file) is denoted by: ######## ---- HERE ---- ##########

Are you aware that UTF-8 and CP-1252 are text encodings? It makes no sense to use them in the context of a jpg file. — lenz, Aug 27 '19 at 06:54
@lenz Yes, I'm aware; however, this is the format the emails are received in. If you know another way to decode and/or parse it, I'm all ears. Also nearly all emails with attached photos work. — Jakar510, Aug 27 '19 at 17:05
It's odd you would see an 0x92 anywhere in a normally encoded email. What are MIME headers of the section of an image that fails to decode. — Max, Aug 27 '19 at 17:20
message_from_bytes is the proper function to call, as you're using. message_from_string is basically legacy. — Max, Aug 27 '19 at 17:20
@Max forgive my ignorance, but what are "MIME headers"? And Ok, good to know for message_from_bytes. — Jakar510, Aug 27 '19 at 17:22
thanks for `replace('\n','').replace('\r','')`. now filenames are showing correct. — Akhil, Jan 06 '22 at 04:29

imaplib with Python 3.7.4 occasionally returns an attachment that fails to be decoded

0 Answers0