1

I am trying to regex email addresses from a set of source code. The data can be found as an attribute to <a href> tags. It's this: data-email="example@email.com"

I'm quite new to regex and came up with this: /\w+\s*=\s*".*?"/ but it doesn't seem to work. Getting my head around it all is difficult.

What could I do?

Appreciate any help.

KriiV
  • 1,882
  • 4
  • 25
  • 43

4 Answers4

2

If your source code is HTML, wouldn't it be easier to use an HTML parser? You could use lxml, for example:

from lxml import etree

html = etree.HTML("""
<html>
    <head>
        <title>History of Roundish Stones in the Paleozoic Era</title>
    </head>
    <body>
        <a href="#" data-email="example@email.com">Andrew S. Johnson</a>
        <a href="#" data-email="other-example@email.com">E. Idle</a>
    </body>
</html>
""")

print(html.xpath('//@data-email'))

This prints:

['example@email.com', 'other-example@email.com']
Wander Nauta
  • 18,832
  • 1
  • 45
  • 62
0

If I get your question correctly, this is what you might need to extract email addresses:

>>> import re
>>> print(re.findall(r'(?<=data-email=")[^"]*(?=")', '<b><a href="/abcd.html" data-email="example@email.com">abcd</a></b>'))
['example@email.com']
riteshtch
  • 8,629
  • 4
  • 25
  • 38
0

you can get the email address by using the following. I'm not sure what exactly you are dealing with it would be nice if you could post some examples as well. However you can try this, it might help you.

re.compile("([\w\-\.]+@(\w[\w\-]+\.)+[\w\-]+)")

This will help you get "example@email.com"

Abhi
  • 442
  • 1
  • 10
  • 24
0

BeautifulSoup is your friend:

from bs4 import BeautifulSoup as BS

emails = []
soup = BS(html_string, 'html5lib')
for a in soup.findAll('a'):
    try:
        emails.append(a['data-email'])
    except KeyError:
        continue
Kruger
  • 177
  • 1
  • 4