I have thousands of emails stored in either plain text or HTML. All of the plain text emails are formatted pretty much the same, so extracting just the actual email message has been simple.
But the HTML emails are all over the place, and I'm finding it difficult to come up with a mathod of extracting the body message only. There's a lot of other junk in the email that >I don't want, such as "This email was generated by..." and a bunch of other non-user generated text that changes from email to email.
Is there some way for Python to identify what resembles a body of text or complete sentences?
I've already tried using regular expressions found here: a Regex for extracting sentence from a paragraph in python
But the problem with that was that I have a lot of lines that look like this:
Title* : Mr.
Which the regular expression thinks is a sentence and I don't want extracted.
I've also tried combining that regular expression with NLTK's POS tagger to only print out sentences that have both a Noun and a Verb, but I it doesn't seem to work to well as it's just the built in POS tagger and not trained on any dataset.
So I guess my question is: how can I fix my problem? Am I missing something?