Python : How to parse the Body from a raw email , given that raw email does not have a "Body" tag or anything

Question

It seems easy to get the

From
To
Subject

etc via

import email
b = email.message_from_string(a)
bbb = b['from']
ccc = b['to']

assuming that "a" is the raw-email string which looks something like this.

a = """From root@a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
    by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
    for <ooo@a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root@localhost)
    by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
    Thu, 25 Jul 2013 19:28:59 -0700
From: root@a1.local.tld
Subject: oooooooooooooooo
To: ooo@a1.local.tld
Cc: 
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861@a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"

This is a multi-part message in MIME format.

--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo

--bound1374805739--"""

THE QUESTION

how do you get the Body of this email via python ?

So far this is the only code i am aware of but i have yet to test it.

if email.is_multipart():
    for part in email.get_payload():
        print part.get_payload()
else:
    print email.get_payload()

is this the correct way ?

or maybe there is something simpler such as...

import email
b = email.message_from_string(a)
bbb = b['body']

?

Note that Python 3.6+ has convenience get_body() functions via upcoming default parsing policy, as noted in newer answer by @Doctor J, and note that answer by Todor Minakov is more robust that that by falsetru — nealmcb, Mar 12 '21 at 03:42

score 146 · Answer 1 · edited Dec 24 '22 at 00:22

To be highly positive you work with the actual email body (yet, still with the possibility you're not parsing the right part), you have to skip attachments, and focus on the plain or html part (depending on your needs) for further processing.

As the before-mentioned attachments can and very often are of text/plain or text/html part, this non-bullet-proof sample skips those by checking the content-disposition header:

b = email.message_from_string(a)
body = ""

if b.is_multipart():
    for part in b.walk():
        ctype = part.get_content_type()
        cdispo = str(part.get('Content-Disposition'))

        # skip any text/plain (txt) attachments
        if ctype == 'text/plain' and 'attachment' not in cdispo:
            body = part.get_payload(decode=True)  # decode
            break
# not multipart - i.e. plain text, no attachments, keeping fingers crossed
else:
    body = b.get_payload(decode=True)

BTW, walk() iterates marvelously on mime parts, and get_payload(decode=True) does the dirty work on decoding base64 etc. for you.

Some background - as I implied, the wonderful world of MIME emails presents a lot of pitfalls of "wrongly" finding the message body. In the simplest case it's in the sole "text/plain" part and get_payload() is very tempting, but we don't live in a simple world - it's often surrounded in multipart/alternative, related, mixed etc. content. Wikipedia describes it tightly - MIME, but considering all these cases below are valid - and common - one has to consider safety nets all around:

Very common - pretty much what you get in normal editor (Gmail,Outlook) sending formatted text with an attachment:

multipart/mixed
 |
 +- multipart/related
 |   |
 |   +- multipart/alternative
 |   |   |
 |   |   +- text/plain
 |   |   +- text/html
 |   |      
 |   +- image/png
 |
 +-- application/msexcel

Relatively simple - just alternative representation:

multipart/alternative
 |
 +- text/plain
 +- text/html

For good or bad, this structure is also valid:

multipart/alternative
 |
 +- text/plain
 +- multipart/related
      |
      +- text/html
      +- image/jpeg

P.S. My point is don't approach email lightly - it bites when you least expect it :)

Thanks for this thorough example and for spelling out a warning - in contrary to the accepted answer. I think this is a far better/safer approach. — Simon Steinberger, Jun 23 '17 at 15:04
Ah, very good! `.get_payload(decode=True)` instead of just `.get_payload()` has made life much easier, thanks! — Mark, Jul 30 '19 at 03:55
I am looking for only the body from .get_payload(decode=True). Is there any way for it ? — abhijitcaps, Apr 04 '21 at 00:33

score 98 · Accepted Answer · edited Mar 04 '14 at 13:34

98

Use Message.get_payload

b = email.message_from_string(a)
if b.is_multipart():
    for payload in b.get_payload():
        # if payload.is_multipart(): ...
        print payload.get_payload()
else:
    print b.get_payload()

edited Mar 04 '14 at 13:34

Gagandeep Singh

5,755
4
41
60

answered Jul 26 '13 at 06:30

falsetru

357,413
63
732
636

3

Other answers do a better job of being more robust and leveraging the newer get_body() functionality. – nealmcb Mar 12 '21 at 03:45
3

@nealmcb, When I answered there was no `get_body` ;) Seems it appeared since Python 3.6. BTW, this question is tagged `python-2.7` where you can't use `get_body` – falsetru Mar 12 '21 at 07:05
2

Good point! Of course with Python 2 now over a year past end-of-life, we can assume much more interest in modern solutions. But also note that as Todor describes, many emails have tricky structures, so a more general approach is a good idea, and your "..." is not very specific. – nealmcb Mar 13 '21 at 15:55

score 19 · Answer 3 · answered Mar 15 '18 at 09:05

19

There is very good package available to parse the email contents with proper documentation.

import mailparser

mail = mailparser.parse_from_file(f)
mail = mailparser.parse_from_file_obj(fp)
mail = mailparser.parse_from_string(raw_mail)
mail = mailparser.parse_from_bytes(byte_mail)

How to Use:

mail.attachments: list of all attachments
mail.body
mail.to

answered Mar 15 '18 at 09:05

Amit Sharma

2,297
3
19
25

2

Library is great, but I had to make my own class that inherits from `MailParser` and override **body** method, because it joins the parts of email's body with **"\n--- mail_boundary ---\n"** which was not ideal for me. – sup Sep 21 '18 at 12:30
hi @avram, could you please share the class that you have written ? – Amey P Naik May 13 '19 at 12:53
I managed to split the result on "\n--- mail_boundary ---\n". – Amey P Naik May 14 '19 at 07:16
3

@AmeyPNaik Here I made a quick github gist: https://gist.github.com/aleksaa01/ccd371869f3a3c7b3e47822d5d78ccdf – sup May 14 '19 at 20:11
1

@AmeyPNaik in their [documentation](https://pypi.org/project/mail-parser/), it says: _mail-parser can parse Outlook email format (.msg). To use this feature, you need to install libemail-outlook-message-perl package_ – Ciprian Tomoiagă Dec 03 '19 at 10:57

score 15 · Answer 4 · answered May 10 '20 at 06:53

15

Python 3.6+ provides built-in convenience methods to find and decode the plain text body as in @Todor Minakov's answer. You can use the EMailMessage.get_body() and get_content() methods:

msg = email.message_from_string(s, policy=email.policy.default)
body = msg.get_body(('plain',))
if body:
    body = body.get_content()
print(body)

Note this will give None if there is no (obvious) plain text body part.

If you are reading from e.g. an mbox file, you can give the mailbox constructor an EmailMessage factory:

mbox = mailbox.mbox(mboxfile, factory=lambda f: email.message_from_binary_file(f, policy=email.policy.default), create=False)
for msg in mbox:
    ...

Note you must pass email.policy.default as the policy, since it's not the default...

answered May 10 '20 at 06:53

Doctor J

5,974
5
44
40

4

Why isn't `email.policy.default` the default? Seems like it should be. – PartialOrder Jul 28 '20 at 18:29
@PartialOrder Backwards compatibility. It *will* be the default, and you should already use it now. – Bergi Mar 08 '21 at 10:54
This is very informative and encouraging, but had me confused for a while. The `lambda` doesn't reveal lack of import of "email.policy" right away, and I guess the factory is not consulted if you access a message explicitly, e.g. via `mbox.get_message(0)` Folks can note also the more explicit `make_EmailMessage` factory function approach at https://stackoverflow.com/a/57550079/507544 – nealmcb Mar 12 '21 at 02:37
1

```$ python -c 'import email, sys; msg = email.message_from_string(sys.stdin.read()); print(msg.get_body())' <<< some_text Traceback (most recent call last): File "", line 1, in AttributeError: Message instance has no attribute 'get_body'``` I got this error. Would you please let me know what is wrong? – user1424739 Mar 19 '23 at 02:42

score 4 · Answer 5 · answered Jul 26 '13 at 06:36

There is no b['body'] in python. You have to use get_payload.

if isinstance(mailEntity.get_payload(), list):
    for eachPayload in mailEntity.get_payload():
        ...do things you want...
        ...real mail body is in eachPayload.get_payload()...
else:
    ...means there is only text/plain part....
    ...use mailEntity.get_payload() to get the body...

Good Luck.

score 1 · Answer 6 · edited Dec 04 '18 at 15:18

If emails is the pandas dataframe and emails.message the column for email text

## Helper functions
def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

def split_email_addresses(line):
    '''To separate multiple email addresses'''
    if line:
        addrs = line.split(',')
        addrs = frozenset(map(lambda x: x.strip(), addrs))
    else:
        addrs = None
    return addrs 

import email
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails['message']))
emails.drop('message', axis=1, inplace=True)
# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
    emails[key] = [doc[key] for doc in messages]
# Parse content from emails
emails['content'] = list(map(get_text_from_email, messages))
# Split multiple email addresses
emails['From'] = emails['From'].map(split_email_addresses)
emails['To'] = emails['To'].map(split_email_addresses)

# Extract the root of 'file' as 'user'
emails['user'] = emails['file'].map(lambda x:x.split('/')[0])
del messages

emails.head()

score 0 · Answer 7 · answered Nov 07 '22 at 19:37

Small update based on Doctor J's answer. Parses the plaintext portion (if any) of the email message. May try getting the html as well since the (bad) habit of sending html only mails are increasingly popular.

from email import message_from_string
from email import policy

raw_string = raw_string.strip() # where raw_string is the email message (DATA)
msg = message_from_string(raw_string, policy=policy.default)
body = msg.get_body(('plain',))
if body:
    body = body.get_content()
    print(body)

When working with email DATA as strings, it's necessary to strip leading/trailing whitespace, wasted a lot of time without it!

Deepesh Verma · Answer 8 · 2019-01-30T08:57:29.487

Here's the code that works for me everytime (for Outlook emails):

#to read Subjects and Body of email in a folder (or subfolder)

import win32com.client  
#import package

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")  
#create object

#get to the desired folder (MyEmail@xyz.com is my root folder)

root_folder = 
outlook.Folders['MyEmail@xyz.com'].Folders['Inbox'].Folders['SubFolderName']

#('Inbox' and 'SubFolderName' are the subfolders)

messages = root_folder.Items

for message in messages:
if message.Unread == True:    # gets only 'Unread' emails
    subject_content = message.subject
# to store subject lines of mails

    body_content = message.body
# to store Body of mails

    print(subject_content)
    print(body_content)

    message.Unread = True         # mark the mail as 'Read'
    message = messages.GetNext()  #iterate over mails

Perhaps spell out that this is for Outlook on Windows, not for real email. — tripleee, Jan 30 '19 at 08:20

Python : How to parse the Body from a raw email , given that raw email does not have a "Body" tag or anything

8 Answers8

Linked

Related