0

I'm trying to figure out how to get only the text portion of an email message. Using the following code I'm able to get the body, but it is always followed by the html of the email, which I don't need. How can I tell my script to ignore the html?

import imaplib
import email

def extract_body(payload):
    if isinstance(payload,str):
        return payload
    else:
        return '\n'.join([extract_body(part.get_payload()) for part in payload])

conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
conn.login("username", "password")
conn.select()
typ, data = conn.search(None, 'UNSEEN')
try:
    for num in data[0].split():
        typ, msg_data = conn.fetch(num, '(RFC822)')
        for response_part in msg_data:
            if isinstance(response_part, tuple):
                msg = email.message_from_string(response_part[1])
                subject=msg['subject']                   
                print(subject)
                payload=msg.get_payload()
                body=extract_body(payload)
                print(body)
        typ, response = conn.store(num, '+FLAGS', r'(\Seen)')
finally:
    try:
        conn.close()
    except:
        pass
    conn.logout()
  • You'll have to use a HTML parser to extract it. Look at this question for a possible solution: http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text – Blender Oct 05 '12 at 00:25
  • @Blender Are you sure? I'm already getting the text-only part first. Isn't there a way to ignore the second part which is the message again, except with the html? – dixonticonderoga Oct 05 '12 at 00:45

1 Answers1

0

You’re calling get_payload() on each item of the multipart container, and stringing them together. Just iterate over each payload in the multipart container, and select the one with the Content-Type that you’re looking for.

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
  • 1
    OK, thanks. So it seems like `payload` has 2 parts, one with just text and one with html. How do I change this part to only give me the first part of payload: `return '\n'.join([extract_body(part.get_payload()) for part in payload])` – dixonticonderoga Oct 06 '12 at 01:07