1

After 5 hours of trying, time to get some help. Sifted through all the stackoverflow questions related to this but couldn't find the answer.

The code is a gmail parser - works for most emails but some emails cause the UnicodeDecodeError. The problem is "raw_email.decode('utf-8')" but changing it (see comments) causes a different problem down below.

# Source: https://stackoverflow.com/questions/7314942/python-imaplib-to-get-gmail-inbox-subjects-titles-and-sender-name

import datetime
import time
import email
import imaplib
import mailbox
from vars import *
import re                   # to remove links from str
import string


EMAIL_ACCOUNT = 'gmail_login'
PASSWORD = 'gmail_psswd'

mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login(EMAIL_ACCOUNT, PASSWORD)
mail.list()
mail.select('inbox')
result, data = mail.uid('search', None, "ALL") # (ALL/UNSEEN)

id_list = data[0].split()
email_rev = reversed(id_list)             # Returns a type list.reverseiterator, which is not list
email_list = list(email_rev)
i = len(email_list)

todays_date = time.strftime("%m/%d/%Y")

for x in range(i):
    latest_email_uid = email_list[x]
    result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
    raw_email = email_data[0][1]                                 # Returns a byte
    raw_email_str = raw_email.decode('utf-8')                    # Returns a str
    #raw_email_str = base64.b64decode(raw_email_str1)      # Tried this but didn't work.
    #raw_email_str = raw_email.decode('utf-8', errors='ignore')  # Tried this but caused a TypeError down where var subject is created because something there is expecting a str or byte-like 
    email_message = email.message_from_string(raw_email_str)

    date_tuple = email.utils.parsedate_tz(email_message['Date'])           
    date_short = f'{date_tuple[1]}/{date_tuple[2]}/{date_tuple[0]}'

    # Header Details
    if date_short == '12/23/2019':
        #if date_tuple:
        #    local_date = datetime.datetime.fromtimestamp(email.utils.mktime_tz(date_tuple))
        #    local_message_date = "%s" %(str(local_date.strftime("%a, %d %b %Y %H:%M:%S")))
        email_from = str(email.header.make_header(email.header.decode_header(email_message['From'])))
        subject = str(email.header.make_header(email.header.decode_header(email_message['Subject'])))
        #print(subject)
        if email_from.find('restaurants@uber.com') != -1:
            print('yay')

        # Body details
        if email_from.find('restaurants@uber.com') != -1 and subject.find('Payment Summary') != -1:
            for part in email_message.walk():
                if part.get_content_type() == "text/plain":
                    body = part.get_payload(decode=True)
                    body = body.decode("utf-8")             # Convert byte to str
                    body = body.replace("\r\n", " ")
                    text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', body)           # removes url links
                    text2 = text.translate(str.maketrans('', '', string.punctuation))
                    body_list = re.sub("[^\w]", " ",  text2).split()

                    print(body_list)
                    print(date_short)

                else:
                    continue
tripleee
  • 175,061
  • 34
  • 275
  • 318
Ace Pash
  • 1
  • 2
  • 5
  • To make your life easier you might want to have a look at https://imapclient.readthedocs.io/en/2.1.0/. This deals with most low level stuff and is quite easy to use. Your code above never gonna work reliable unless you implement all edge cases of the mail and imap RFCs (including different encodings on various mail message parts and such). – jerch Dec 27 '19 at 23:33
  • Thanks @jerch! Amazing resource and it works! However, it doesn't show how to extract the body of the email, which is what I want to parse. Did I miss it somewhere? – Ace Pash Dec 29 '19 at 01:34
  • Right, imapclient stops to help at the message itself (for a simple reason - a large attachment would penalize parsing - prolly unwanted). To reliably parse a raw mail message plz refer to stdlib modules like `email.message` and `email.parser` (https://docs.python.org/3/library/email.html). Sadly a mail message can be complicated (due to the parts logic with different encodings and mimetypes), you will have to work through the docs to cover those aspects. – jerch Dec 29 '19 at 11:48
  • Thank you @Jerch, would you please be able to elaborate a bit more about what the code would look like? I've tried various forms of smtplib, imaplib, imapclient, and email libraries without success. I got real close using imaplib but couldn't figure out the problem. Any guidance would help! – Ace Pash Dec 31 '19 at 09:30
  • Try to use high level lib: https://pypi.org/project/imap-tools/ All is already parsed. – Vladimir Sep 22 '20 at 08:52

3 Answers3

1

Here is an example how to retrieve and read mail parts with imapclient and the email.* modules from the python standard libs:

from imapclient import IMAPClient
import email
from email import policy


def walk_parts(part, level=0):
    print(' ' * 4 * level + part.get_content_type())
    # do something with part content (applies encoding by default)
    # part.get_content()
    if part.is_multipart():
        for part in part.get_payload():
            get_parts(part, level + 1)


# context manager ensures the session is cleaned up
with IMAPClient(host="your_mail_host") as client:
    client.login('user', 'password')

    # select some folder
    client.select_folder('INBOX')

    # do something with folder, e.g. search & grab unseen mails
    messages = client.search('UNSEEN')
    for uid, message_data in client.fetch(messages, 'RFC822').items():
        email_message = email.message_from_bytes(
            message_data[b'RFC822'], policy=policy.default)
        print(uid, email_message.get('From'), email_message.get('Subject'))

    # alternatively search for specific mails
    msgs = client.search(['SUBJECT', 'some subject'])

    #
    # do something with a specific mail:
    #

    # fetch a single mail with UID 12345
    raw_mails = client.fetch([12345], 'RFC822')

    # parse the mail (very expensive for big mails with attachments!)
    mail = email.message_from_bytes(
        raw_mails[12345][b'RFC822'], policy=policy.default)

    # Now you have a python object representation of the mail and can dig
    # into it. Since a mail can be composed of several subparts we have
    # to walk the subparts.

    # walk all parts at once
    for part in mail.walk():
        # do something with that part
        print(part.get_content_type())
    # or recurse yourself into sub parts until you find the interesting part
    walk_parts(mail)

See the docs for email.message.EmailMessage. There you find all needed bits to read into a mail message.

jerch
  • 682
  • 4
  • 9
  • This could still fail for many real-world messages where the sender declared the wrong content-transfer-encoding. Historically, many clients declared `"us-ascii"` but then sent some undeclared 8-bit encoding anyway; these days, many probably claim `"utf-8"` but then actually use something else. – tripleee Dec 31 '19 at 15:39
  • True, but thats always the case - if something states to be XY but is Z, you have a bigger problem (which needs more involved recovery strategies and cannot be blueprinted this easy). – jerch Dec 31 '19 at 15:47
0

use 'ISO 8859-1' instead of 'utf-8'

Ray
  • 124
  • 7
0

I had the same issue And after a lot of research I realized that I simply need to use, message_from_bytes function from email rather than using message_from_string

so for your code simply replace:

 raw_email_str = raw_email.decode('utf-8')        
 email_message = email.message_from_string(raw_email_str)

to

email_message = email.message_from_bytes(raw_email)

should work like a charm :)

kshitij Nigam
  • 44
  • 1
  • 8