Regexing an email from HTML

Question

I am trying to regex email addresses from a set of source code. The data can be found as an attribute to <a href> tags. It's this: data-email="example@email.com"

I'm quite new to regex and came up with this: /\w+\s*=\s*".*?"/ but it doesn't seem to work. Getting my head around it all is difficult.

What could I do?

Appreciate any help.

Possible duplicate of http://stackoverflow.com/questions/28888194/extract-emails-from-html-using-regex?rq=1 — Ashish, Apr 15 '16 at 08:21
[Stop Validating Email Addresses With Regex](https://davidcel.is/posts/stop-validating-email-addresses-with-regex/) — Thomas Ayoub, Apr 15 '16 at 08:22

score 2 · Accepted Answer · answered Apr 15 '16 at 08:27

If your source code is HTML, wouldn't it be easier to use an HTML parser? You could use lxml, for example:

from lxml import etree

html = etree.HTML("""
<html>
    <head>
        <title>History of Roundish Stones in the Paleozoic Era</title>
    </head>
    <body>
        <a href="#" data-email="example@email.com">Andrew S. Johnson</a>
        <a href="#" data-email="other-example@email.com">E. Idle</a>
    </body>
</html>
""")

print(html.xpath('//@data-email'))

This prints:

['example@email.com', 'other-example@email.com']

score 0 · Answer 2 · answered Apr 15 '16 at 08:22

If I get your question correctly, this is what you might need to extract email addresses:

>>> import re
>>> print(re.findall(r'(?<=data-email=")[^"]*(?=")', '<b><a href="/abcd.html" data-email="example@email.com">abcd</a></b>'))
['example@email.com']

score 0 · Answer 3 · answered Apr 15 '16 at 08:43

you can get the email address by using the following. I'm not sure what exactly you are dealing with it would be nice if you could post some examples as well. However you can try this, it might help you.

re.compile("([\w\-\.]+@(\w[\w\-]+\.)+[\w\-]+)")

This will help you get "example@email.com"

score 0 · Answer 4 · answered Apr 15 '16 at 09:20

0

BeautifulSoup is your friend:

from bs4 import BeautifulSoup as BS

emails = []
soup = BS(html_string, 'html5lib')
for a in soup.findAll('a'):
    try:
        emails.append(a['data-email'])
    except KeyError:
        continue

answered Apr 15 '16 at 09:20

Kruger

177
1
4

Regexing an email from HTML

4 Answers4