0

I'm extracting the links from the email using imaplib and email, but the result is missing the main link, although the others are there.

#Assume that I know the id of an email that I need to parse '599'
typ, email_data = mail.fetch('599', '(RFC822)')

msg = email.message_from_bytes(email_data[0][1])
print(msg.get_payload()[0].get_payload())

Here's my email with three links:

gmail

This is the result:

Today's highlights

Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions

This week, I was tutoring a student client of mine. We have been working ou= r way through using Auth0. It=E2=80=A6

Jay (https://medium.com/@second-link) in ProjectWT (https://medium.com/@third-link) =C2=B73 min read

Links two and three are absolutely identical to those in the email, but as you can see the first link is missing (also in all similar cases) and I can't understand why. Any help would be appreciated.

Adding the default policy is not helping.

message = email.message_from_bytes(msg_as_bytes, policy=policy.default)
  • 1
    You are not deciding your message, as it still has Quoted-Printable encoding applied. Try adding ‘policy=email.default’ to your message from bytes call which should switch to the newer parser which should do this automatically for you. – Max Jul 03 '22 at 15:31
  • I have tried adding a policy like `this message = email.message_from_bytes(msg_as_bytes, policy=policy.default)`. Result is the same. – Aleh Belausau Jul 03 '22 at 15:49
  • Can you please [edit] to remove the IMAP parts and reduce this to a [mre] with a simple small email message which exhibits this problem? – tripleee Jul 04 '22 at 20:16

1 Answers1

0

The immediate problem seems to be that you are probably extracting links from a MIME part which simply contains only two links. The structure of the message is apparently something like

-+ multipart/alternative
 -- text/plain
 -+ multipart/related
  -- text/html
  -- image/png
  -- image/png

where your screen shot shows the text/html part with its related images, but the text excerpt shows the first text/plain part, and the link extraction targets that, too.

In the general case, if you are processing a collection of messages from multiple senders using multiple email clients and sending multiple types of messages (some with embedded images, others perhaps a PDF attacment or a collection of CSV files), you will need to perform an analysis of each individual message's structure and decide which MIME part(s) you want to extract based on those results. But for the common case where the message's top-level structure is either just a single body part or a common multipart/alternative with a text/plain and a text/html rendering of the same "main" message (in any order), recent versions of Python offer a simple method which attempts to "do the right thing".

As an aside, the email module in the standard library was overhauled in Python 3.6 to be more logical, versatile, and succinct; new code should target the (no longer very) new EmailMessage API. When you supply a policy argument to message_from_bytes, this is what you get (without it, you get the legacy email.message.Message API, also called "compat32" because it's compatible back to Python 3.2 and earlier. The new API was informally introduced in Python 3.3, though it only became the preferred and official API in 3.6.)

With that, the following code should hopefully do what you want.

msg = email.message_from_bytes(email_data[0][1], policy=default)
print(msg.get_body())

The new API should not require you to separately request decoding of the extracted body part's content transfer encoding, which was another problem with your original attempt.

get_body() (which did not exist at all in the legacy API) lets you specify an ordered list of preferred MIME types, but the default preference list should do what you want in this case. It will prefer HTML if available, and otherwise fall back to plain text.

For testing, here is a quick and dirty example message with the assumed structure. If you need more help, probably post a new question with a sample message (ideally pared down to just the essentials, and probably without the IMAP code which isn't relevant for this particular problem).

From: tripleee <me@example.net>
To: you <aleh.b@example.org>
Subject: Simple multipart example
MIME-Version: 1.0
Content-type: multipart/alternative; boundary="snowden-risen-woodward-manning"

--snowden-risen-woodward-manning
Content-type: text/plain; charset=utf-8
Content-transfer-encoding: quoted-printable

Today's highlights

Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions=

--snowden-risen-woodward-manning
Content-type: multipart/related; boundary="pol-pot-stalin-trump-mao"

--pol-pot-stalin-trump-mao
Content-type: text/html; charset=utf-8
Content-transfer-encoding: quoted-printable

<h1>Today's highlights</h1>

<p><a href=3D"https://example.com/spam">=
Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions=
</a></p>
<img src="cid:foo@example.net"/>
<img src="cid:bar@example.net"/>

--pol-pot-stalin-trump-mao
Content-type: image/png
Content-transfer-encoding: base64
Content-id: <foo@example.net>

somebase64gobbledygook=
--pol-pot-stalin-trump-mao
Content-type: image/png
Content-transfer-encoding: base64
Content-id: <bar@example.net>

morebase64gobbledygook=
--pol-pot-stalin-trump-mao--
--snowden-risen-woodward-manning--
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • The insidious idea to use "spook" strings in MIME boundaries is due to [Noah Friedman.](http://www.splode.com/~friedman/software/emacs-lisp/) – tripleee Jul 06 '22 at 14:38
  • Perhaps see also https://stackoverflow.com/questions/48562935/what-are-the-parts-in-a-multipart-email for an overview of MIME parts. – tripleee Jul 06 '22 at 16:29