1

I have thousands of emails stored in either plain text or HTML. All of the plain text emails are formatted pretty much the same, so extracting just the actual email message has been simple.

But the HTML emails are all over the place, and I'm finding it difficult to come up with a mathod of extracting the body message only. There's a lot of other junk in the email that >I don't want, such as "This email was generated by..." and a bunch of other non-user generated text that changes from email to email.

Is there some way for Python to identify what resembles a body of text or complete sentences?

I've already tried using regular expressions found here: a Regex for extracting sentence from a paragraph in python

But the problem with that was that I have a lot of lines that look like this:

Title* : Mr.

Which the regular expression thinks is a sentence and I don't want extracted.

I've also tried combining that regular expression with NLTK's POS tagger to only print out sentences that have both a Noun and a Verb, but I it doesn't seem to work to well as it's just the built in POS tagger and not trained on any dataset.

So I guess my question is: how can I fix my problem? Am I missing something?

Community
  • 1
  • 1
yannikrock
  • 55
  • 2
  • 5
  • 1
    Are you building a ham/spam classifier? You can consider non-sentences as spam and sentences as ham. – alvas Jun 25 '13 at 09:58

3 Answers3

0

I expect that all the sentences that you need in html paragraphs i.e. surrounded by <P></P> tags. You could use a re to extract those first and then process those.

Steve Barnes
  • 27,618
  • 6
  • 63
  • 73
  • Unfortunately the html is so inconsistent that some of them have

    surrounding either all the text or just a small snippet. But through further examination of the emails I've found the

    tags method to be true enough of the time for this to help. Thank you!

    – yannikrock Aug 06 '13 at 07:36
0

You could use BeautifulSoup to parse the HTML tags of the Email and then go on by using regex

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
Ixl123
  • 1
0

Refer to nltk.tokenize.sent_tokenize(text) in NLTK's tokenizer package. Note that you'll have to try it out for yourself, on your target text. When tokenizing text into sentences, there are always some odd cases where one sentence tokenizer or another produces wrong output.

prash
  • 324
  • 3
  • 14