Get email address from string in Python with Regex

Question

** NOTE: I have already researched this question heavily on Stack Overflow and have not found a solution! I am unable to apply the other answers to my problem, so I need some help. **

The challenge: I want to get an email address from a string but am having trouble targeting the email address only with Regex.

The email address I want from the HTML is:

query-e1h1@email.net

The HTML is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<html>\r\n<head></head>\r\n<body>\r\n<a name="top"></a>Back to Category Index</a></p>\r\n<p>-----------------------------------<br/></p>\r\n\r\n67)<a name="e1h1" id="e1h1"></a> Summary: Solar Eclipse 2024 Travel\r\n<br/><br/>\r\n<p>Name: laure gem wilson\r\nRoadtrippers\r\n</p>Category: Travel\r\n<br/><br/>\r\nEmail: <a href="mailto:query-e1h1@email.net">query-e1h1@email.net</a>\r\n<br/><br/>\r\nOutlet: Roadtrip<br/><br/>\r\nDeadline: 7:00 PM EST - 8 July\r\n<br/><br/>\r\n<p>\r\nQuery: \r\n<br/><br/>\r\nHi, I am on assignment to write a feature about planning a road<br/>trip to experience the Solar Eclipse 2024, including path of<br/>totality, advice about viewing, and recommendations for when and<br/>where to book accommodations, thanks!<br/>\r\n</p>\r\n<p>\r\nRequirements: \r\n<br /><br />\r\nMust be domestic USA<br/>\r\n</p>\r\n<p><a href="#top">Back to Top</a> <a href="#Travel">Back to Category Index</a></p>\r\n<p>-----------------------------------<br/>

My Python code is:

Query_Email = re.findall(r'Email:.+', msg_content[index_counter])

This returns:

<a href="mailto:query-e1h1@email.net">query-e1h1@email.net</a>
Authority Magazine<br/><br/>

What was wrong with using bs4 and `soup.select_one('[href^=mailto]').text` ? — QHarr, Jul 15 '22 at 03:16
Vandalizing your own question (and user name for that matter) is not acceptable behavior. — ShadowRanger, Jul 27 '22 at 23:27

score 0 · Answer 1 · answered Jul 14 '22 at 06:16

0

You can just get the email within the mailto: part with a lazy catch up to the first ">:

mailto:(.*?)">

https://regex101.com/r/Xk4Ywk/1

This should capture the email within the group.

answered Jul 14 '22 at 06:16

M B

2,700
2
15
20

score 0 · Answer 2 · answered Jul 14 '22 at 06:40

If you want just extract email address from any text, email regex is one of the most popular regexes and such regex is easy to find, just google 'email regex' and you'd get your answer. I used first search result and slightly modified (i have put \b - word boudnaries instead of ^ and $ - text boundaries):

\b[a-zA-Z0-9.! #$%&'*+\/=? ^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*\b

Here's regex demo.

BUT

if you're trying to extract information from HTML,DO NOT USE REGEX, becuase :)

@user3424298 please take a look at [html library for python](https://docs.python.org/3/library/html.html) — Michał Turczyn, Jul 14 '22 at 18:18

The fourth bird · Answer 3 · 2022-07-14T07:47:07.503

You could use your Email: prefix and use a capture group:

\bEmail:\s*<a\s[^<>]*\bhref="mailto:([^"]+)"

Explanation

\nEmail:\s* match Email: followed by optional whitespace chars
<a\s Match <a followed by a whitespace char
[^<>]* Optionally match any char other than < and >
\bhref="mailto: Match literally preceded by a word boundary
([^"]+)" Capture the value between double quotes in group 1

Regex demo

import re

pattern = r"\bEmail:\s*<a\s[^<>]*\bhref=\"mailto:([^\"]+)\""
s = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<html>\r\n<head></head>\r\n<body>\r\n<a name="top"></a>Back to Category Index</a></p>\r\n<p>-----------------------------------<br/></p>\r\n\r\n67)<a name="e1h1" id="e1h1"></a> Summary: Solar Eclipse 2024 Travel\r\n<br/><br/>\r\n<p>Name: laure gem wilson\r\nRoadtrippers\r\n</p>Category: Travel\r\n<br/><br/>\r\nEmail: <a href="mailto:query-e1h1@email.net">query-e1h1@email.net</a>\r\n<br/><br/>\r\nOutlet: Roadtrip<br/><br/>\r\nDeadline: 7:00 PM EST - 8 July\r\n<br/><br/>\r\n<p>\r\nQuery: \r\n<br/><br/>\r\nHi, I am on assignment to write a feature about planning a road<br/>trip to experience the Solar Eclipse 2024, including path of<br/>totality, advice about viewing, and recommendations for when and<br/>where to book accommodations, thanks!<br/>\r\n</p>\r\n<p>\r\nRequirements: \r\n<br /><br />\r\nMust be domestic USA<br/>\r\n</p>\r\n<p><a href="#top">Back to Top</a> <a href="#Travel">Back to Category Index</a></p>\r\n<p>-----------------------------------<br/>"""

print(re.findall(pattern, s))

Output

['query-e1h1@email.net']

Note that if you have a dom parser, that would be a better option.

Get email address from string in Python with Regex

3 Answers3