2

What software can I use to process raw email text to remove the signature, quoted thread text, etc...

For example, here is an email. I would like to get just the "Thanks guys." text or more if there was more text there. I do not want the HTML signature (in the first red block) or the old emails that the person was replying to (in the second red block)

enter image description here

Skills
  • 113
  • 2
  • 7

1 Answers1

-2

You can try Message.get_payload from email message handling package.

import email

with open('test.txt', 'r') as myfile:
    data=myfile.read()

body = email.message_from_string(data)
if body.is_multipart():
    for payload in body.get_payload():
        print(payload.get_payload().strip())
else:
    print(body.get_payload().strip())

It outputs:

this is the body text
this is the attachment text

The test.txt file contains the following.

From: John Doe <example@example.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
        boundary="XXXXboundary text"

This is a multipart message in MIME format.

--XXXXboundary text 
Content-Type: text/plain

this is the body text

--XXXXboundary text 
Content-Type: text/plain;
Content-Disposition: attachment;
        filename="test.txt"

this is the attachment text

--XXXXboundary text--
Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161