0

My question is about searching through html format with Python. I am using this code:

with urllib.request.urlopen("http://") as url:
    data = url.read().decode()

now this returns the whole HTML code from the page and I want to extract all email-addresses.

Can somebody lend me a hand here? Thanks in advance

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
zorange
  • 71
  • 6
  • 3
    Check out [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). – FamousJameous Feb 28 '18 at 21:49
  • It might be helpful if you can provide an example of the data that comes back, so we can help you figure out how to parse out the email addresses – SimaPro Feb 28 '18 at 21:50
  • what about [pyquery: a jquery-like library for python](https://pythonhosted.org/pyquery/)? – Sphinx Feb 28 '18 at 21:52
  • just do not think, mention, suggest or use regex or [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Patrick Artner Feb 28 '18 at 21:54
  • man i havent seen zalgo in a while ;P – Joran Beasley Feb 28 '18 at 22:02

2 Answers2

1

Remember that you should not use regex for actual HTML parsing (Thanks @Patrick Artner), but you can use beautiful soup to extract all visible text or comments on a web page. Then you can use this text (which is just a string) to look for email addresses. Here is how you can do it:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib
import re

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)
    return u" ".join(t.strip() for t in visible_texts)

with urllib.request.urlopen("https://en.wikipedia.org/wiki/Email_address") as url:
    data = url.read().decode()
    text = text_from_html(data)
    print(re.findall(r"[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*", text))

The two helper functions just grab all text that can be seen on the page, and then the ridiculously long regex just pulls all email addresses from that text. I used wikipedia.com's article on emails as an example, and here is the output:

['John.Smith@example.com', 'local-part@domain', 'jsmith@example.com', 'john.smith@example.org', 'local-part@domain', 'John..Doe@example.com', 'fred+bah@domain', 'fred+foo@domain', 'fred@domain', 'john.smith@example.com', 'john.smith@example.com', 'jsmith@example.com', 'JSmith@example.com', 'john.smith@example.com', 'john.smith@example.com', 'prettyandsimple@example.com', 'very.common@example.com', 'disposable.style.email.with+symbol@example.com', 'other.email-with-dash@example.com', 'fully-qualified-domain@example.com', 'user.name+tag+sorting@example.com', 'user.name@example.com', 'x@example.com', 'example-indeed@strange-example.com', 'admin@mailserver1', "#!$%&'*+-/=?^_`{}|~@example.org", 'example@s.solutions', 'user@localserver', 'A@b', 'c@example.com', 'l@example.com', 'right@example.com', 'allowed@example.com', 'allowed@example.com', '1234567890123456789012345678901234567890123456789012345678901234+x@example.com', 'john..doe@example.com', 'example@localhost', 'john.doe@example', 'joeuser+tag@example.com', 'joeuser@example.com', 'foo+bar@example.com', 'foobar@example.com']
user3483203
  • 50,081
  • 9
  • 65
  • 94
  • You responded so fast! And great answers. They work and I understand how! Just need to figure out some terms you used but thats what duckduckgo is for. Thanks for your effort all – zorange Feb 28 '18 at 22:12
1

Using beautifulsoup BeautifulSoup And Requests you could do this:

import requests
from bs4 import BeautifulSoup
import re

response = requests.get("your_url")
response_text = response.text
beautiful_response = BeautifulSoup(response_text, 'html.parser')

email_regex = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

list_of_emails = re.findall(email_regex, beautiful_response .text)
list_of_emails_decoded = []
for every_email in list_of_emails:
    list_of_emails_decoded.append(every_email.encode('utf-8'))
Nash
  • 105
  • 1
  • 9