-1

I try to find emails into html using regex but I have problems with some websites.

The main problem is that regex function paralyzes the process and leaves the cpu overloaded.

import re
from urllib.request import urlopen, Request

email_regex = re.compile('([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})', re.IGNORECASE)

request = Request('http://www.serviciositvyecla.com')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
html = str(urlopen(request, timeout=5).read().decode("utf-8", "strict"))

email_regex.findall(html) ## here is where regex takes a long time

I have not problems if the website is another one.

request = Request('https://www.velezmalaga.es/')

If someone know how to solve this problem or know how to timeout the regex function, I will appreciate it.

I use Windows.

2 Answers2

0

I initially tried fiddling with your approach, but then I ditched it and resorted to BeautifulSoup. It worked.

Try this:

import re
import requests

from bs4 import BeautifulSoup


headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}

pages = ['http://www.serviciositvyecla.com', 'https://www.velezmalaga.es/']

emails_found = set()
for page in pages:
    html = requests.get(page, headers=headers).content
    soup = BeautifulSoup(html, "html.parser").select('a[href^=mailto]')
    for item in soup:
        try:
            emails_found.add(item['href'].split(":")[-1].strip())
        except ValueError:
            print("No email :(")

print('\n'.join(email for email in emails_found))

Output:

info@serviciositvyecla.com
oac@velezmalaga.es

EDIT:

One reason your approach doesn't work is, well, the regex itself. The other one is the size (I suspect) of the HTML returned.

See this:

import re
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}

html = requests.get('https://www.velezmalaga.es/', headers=headers).text

op_regx = '([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})'
simplified_regex = '[\w\.-]+@[\w\.-]+\.\w+'

print(f"OP's regex results: {re.findall(op_regx, html)}")
print(f"Simplified regex results: {re.findall(simplified_regex, html)}")

This prints:

OP's regex results: []
Simplified regex results: ['oac@velezmalaga.es', 'oac@velezmalaga.es']
baduker
  • 19,152
  • 9
  • 33
  • 56
  • Hi baduker, thanks for your response. In the edit solution you give me response of 'https://www.velezmalaga.es/' but not for 'http://www.serviciositvyecla.com'. It paralyzes the process and leaves the cpu overloaded – Joakin Montesinos Sep 30 '20 at 07:14
  • @JoakinMontesinos the first part of the answers gets *both* emails. I also wrote about possible reasons why you're code might not work the way you expect. – baduker Sep 30 '20 at 07:57
  • Sure, the solution with bs4 works, but it is not what I am looking for, since you have applied the rule ```.select('a[href^=mailto]')``` . The idea of ​​using reguex in html is not to find the tags. – Joakin Montesinos Sep 30 '20 at 10:07
  • @JoakinMontesinos first of all it's `regex` not `reguex`. The other thing is, that parsing HTML with regex is, well, not possible since it depends on matching the opening and the closing tag which is not possible with regexps. There are better tools for this like BS4 or lxml. Read more here - https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – baduker Sep 30 '20 at 10:10
  • Oops, sorry for the typo and great link. Despite reading, I would like continue using regex. Is there a workaround to have a timeout escaping regex when it freezes the process and leaves the CPU overloaded? Even if the answer is a NULL – Joakin Montesinos Sep 30 '20 at 10:35
0

Finally, I found a solution for no consume all RAM with a regex search. In my problem, obtaining a white result even though there is email on the web is an acceptable solution, as long as not to block the process due to lack of memory. The html of the scraped page contained 5.5 million characters. 5.1 millions did not contain priority information, since it is a hidden div with unintelligible characters. I have added an exception similar than: if len(html) < 1000000: do whathever