I am currently trying to write a program that go through urls and scrape email from the web page (With requests to get html data, and regex to find the email).
But, one of the page "http://www.netronixinc.com/products.aspx" seems to cause my Regex catastrophic backtracking (It takes forever to process).
for re_match in re.finditer(EMAIL_REGEX, data):
print("this never gets print")
I tried to set a timeout alert for the function as suggested from this page "https://stackoverflow.com/questions/492519/timeout-on-a-function-call" but it didn't seem to work in this situation.
So.. What I'm trying to find out is
- Is there a way to avoid catastrophic backtracking in this scenario (finding email)?
- Is there a way to set timeout on regex function in python? (Ubuntu Linux)
Below is the complete code. Many Thanks!
import requests
import signal
import time
import re
EMAIL_REGEX = r"""([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)"""
class Timeout(Exception):
pass
def handler(sig, frame):
raise Timeout
signal.signal(signal.SIGALRM, handler) # register interest in SIGALRM events
data = requests.get('http://www.netronixinc.com/products.aspx')
data = data.text
signal.alarm(2)
for re_match in re.finditer(EMAIL_REGEX, data):
print("hurryout")