-2

** NOTE: I have already researched this question heavily on Stack Overflow and have not found a solution! I am unable to apply the other answers to my problem, so I need some help. **

The challenge: I want to get an email address from a string but am having trouble targeting the email address only with Regex.

The email address I want from the HTML is:

query-e1h1@email.net

The HTML is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<html>\r\n<head></head>\r\n<body>\r\n<a name="top"></a>Back to Category Index</a></p>\r\n<p>-----------------------------------<br/></p>\r\n\r\n67)<a name="e1h1" id="e1h1"></a> Summary: Solar Eclipse 2024 Travel\r\n<br/><br/>\r\n<p>Name: laure gem wilson\r\nRoadtrippers\r\n</p>Category: Travel\r\n<br/><br/>\r\nEmail: <a href="mailto:query-e1h1@email.net">query-e1h1@email.net</a>\r\n<br/><br/>\r\nOutlet: Roadtrip<br/><br/>\r\nDeadline: 7:00 PM EST - 8 July\r\n<br/><br/>\r\n<p>\r\nQuery: \r\n<br/><br/>\r\nHi, I am on assignment to write a feature about planning a road<br/>trip to experience the Solar Eclipse 2024, including path of<br/>totality, advice about viewing, and recommendations for when and<br/>where to book accommodations, thanks!<br/>\r\n</p>\r\n<p>\r\nRequirements: \r\n<br /><br />\r\nMust be domestic USA<br/>\r\n</p>\r\n<p><a href="#top">Back to Top</a> <a href="#Travel">Back to Category Index</a></p>\r\n<p>-----------------------------------<br/>

My Python code is:

Query_Email = re.findall(r'Email:.+', msg_content[index_counter])

This returns:

<a href="mailto:query-e1h1@email.net">query-e1h1@email.net</a>
Authority Magazine<br/><br/>
jmoerdyk
  • 5,544
  • 7
  • 38
  • 49

3 Answers3

0

You can just get the email within the mailto: part with a lazy catch up to the first ">:

mailto:(.*?)">

https://regex101.com/r/Xk4Ywk/1

This should capture the email within the group.

M B
  • 2,700
  • 2
  • 15
  • 20
0

If you want just extract email address from any text, email regex is one of the most popular regexes and such regex is easy to find, just google 'email regex' and you'd get your answer. I used first search result and slightly modified (i have put \b - word boudnaries instead of ^ and $ - text boundaries):

\b[a-zA-Z0-9.! #$%&'*+\/=? ^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*\b

Here's regex demo.

BUT

if you're trying to extract information from HTML,DO NOT USE REGEX, becuase :)

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69
0

You could use your Email: prefix and use a capture group:

\bEmail:\s*<a\s[^<>]*\bhref="mailto:([^"]+)"

Explanation

  • \nEmail:\s* match Email: followed by optional whitespace chars
  • <a\s Match <a followed by a whitespace char
  • [^<>]* Optionally match any char other than < and >
  • \bhref="mailto: Match literally preceded by a word boundary
  • ([^"]+)" Capture the value between double quotes in group 1

Regex demo

import re

pattern = r"\bEmail:\s*<a\s[^<>]*\bhref=\"mailto:([^\"]+)\""
s = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<html>\r\n<head></head>\r\n<body>\r\n<a name="top"></a>Back to Category Index</a></p>\r\n<p>-----------------------------------<br/></p>\r\n\r\n67)<a name="e1h1" id="e1h1"></a> Summary: Solar Eclipse 2024 Travel\r\n<br/><br/>\r\n<p>Name: laure gem wilson\r\nRoadtrippers\r\n</p>Category: Travel\r\n<br/><br/>\r\nEmail: <a href="mailto:query-e1h1@email.net">query-e1h1@email.net</a>\r\n<br/><br/>\r\nOutlet: Roadtrip<br/><br/>\r\nDeadline: 7:00 PM EST - 8 July\r\n<br/><br/>\r\n<p>\r\nQuery: \r\n<br/><br/>\r\nHi, I am on assignment to write a feature about planning a road<br/>trip to experience the Solar Eclipse 2024, including path of<br/>totality, advice about viewing, and recommendations for when and<br/>where to book accommodations, thanks!<br/>\r\n</p>\r\n<p>\r\nRequirements: \r\n<br /><br />\r\nMust be domestic USA<br/>\r\n</p>\r\n<p><a href="#top">Back to Top</a> <a href="#Travel">Back to Category Index</a></p>\r\n<p>-----------------------------------<br/>"""

print(re.findall(pattern, s))

Output

['query-e1h1@email.net']

Note that if you have a dom parser, that would be a better option.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70