2

I have some mails in txt format, that have been forwarded multiple times.

I want to extract the content/the main body of the mail. This should be at the last position in the hierarchy..right? (Someone point this out if I'm wrong).

The email module doesn't give me a way to extract the content. if I make a message object, the object doesn't have a field for the content of the body.

Any idea on how to do it? Any module that exists for the same or any any particular way you can think of except the most naive one of-course of starting from the back of the text file and looking till you find the header.

If there is an easy or straightforward way/module with any other language ( I doubt), please let me know that as well!

Any help is much appreciated!

2 Answers2

0

The email module doesn't give me a way to extract the content. if I make a message object, the object doesn't have a field for the content of the body.

Of course it does. Have a look at the Python documentation and examples. In particular, look at the walk and payload methods.

larsks
  • 277,717
  • 41
  • 399
  • 399
  • I did. Maybe I got it wrong? Sorry, I'm still a newbie. But when I do a get_payload(), it just simply returns the whole email as it is with the headers and stuff. –  Nov 07 '11 at 20:13
  • There are some examples in the [module documentation](http://docs.python.org/library/email-examples.html) that show how to process the contents of a message. The module also provides some useful [iterators](http://docs.python.org/library/email.iterators.html) that accomplish similar things in a different way. – larsks Nov 07 '11 at 20:37
  • Also, [this question](http://stackoverflow.com/questions/594545/email-body-is-a-string-sometimes-and-a-list-sometimes-why) may also help out. – larsks Nov 07 '11 at 20:38
  • okay, maybe I framed the question wrong. My bad. (apologies again, newbie..) after I do a get_payload over a multipart message. Each part gets stored in a list as a message instance. Hence, my last element in the list would the original message. I know up until that point. My last element would look something like : blah blah. blaah blah. some hell lot of headers for that part. my body of message which I need to extract How do I get rid of this stuff when it doesn't have a clear format for every mail. –  Nov 07 '11 at 20:46
0

Try get_payload on the parsed Message object. If there is only one message, the return type will be string, otherwise it will be a list of Message objects.

Something like this:

messages = parsed_message.get_payload()
while type(messages) <> Types.StringType:
    messages = messages[-1].get_payload()
Scott A
  • 7,745
  • 3
  • 33
  • 46
  • yes, it is a list of message objects. It has two elements, one with the headers and the message in plain text. Others with the message and html tags and headers. None of them is the plain text message. –  Nov 07 '11 at 20:19
  • I suspect that the forwards aren't separate MIME parts, so as far as the parser is concerned it's all one message. – Scott A Nov 07 '11 at 20:36