How to parse HTML from eMail body - Python

Question

I'm trying to parse incoming emails in python. I get emails which are part text part HTML. I want to get the HTML part and find a table in the HTML.

I tried using beatifulsoup. But when trying the next code, the bs only get the first "" part and not all the HTML part :

# connecting to the gmail imap server
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
# use m.list() to get all the mailboxes, "INBOX" to get only inbox
m.select("INBOX")
resp, items = m.search(None, '(UNSEEN)') # you could filter using the IMAP rules here (check http://www.example-code.com/csharp/imap-search-critera.asp)
items = items[0].split() # getting the mails id

for emailid in items:
    # getting the mail content
    resp, data = m.fetch(emailid, '(UID BODY[TEXT])')
    text = str(data[0][1])
    soup = bs(text)

How can I use 'bs' for the entire HTML part? Or, is there any other way to parse out an html table from the email body?

'bs' seems to be the best for me, cause I want to find a specific HTML Body which contains specific keyword, and 'bs' search can retrieve the entire table and let me iterate in it.

Look at the text variable. If you're not giving BeautifulSoup the HTML string, then you can't expect sensible results. Garbage in, Garbage Out. — dilbert, Jul 15 '13 at 06:15
I understand that if i give the BS part text and part HTML, it has hard time parsing it, but my question is how to extractonly the html part. I tried searcching for the first html tag and cut the string up to there. I tried extracting only the "text/html" part. In both cases it parsed only the first
part ofthe html and not all the html. — skme, Jul 17 '13 at 08:56
Apparently, I used a wrong parser. Once I changed into 'lxml' parser, it worked just fine. — skme, Aug 11 '13 at 06:51
Perhaps you should post the solution as an answer to your own question as a reference for others in the future. — dilbert, Aug 11 '13 at 07:17

score 4 · Accepted Answer · answered Aug 11 '13 at 09:04

4

Apparently, I used a wrong parser.

Once I changed into 'lxml' parser, it worked just fine.

need to change the next line:

soup = bs(text,"lxml");

answered Aug 11 '13 at 09:04

skme

731
6
24

How to parse HTML from eMail body - Python

1 Answers1