0

I am currently trying to write a program that go through urls and scrape email from the web page (With requests to get html data, and regex to find the email).

But, one of the page "http://www.netronixinc.com/products.aspx" seems to cause my Regex catastrophic backtracking (It takes forever to process).

for re_match in re.finditer(EMAIL_REGEX, data):
    print("this never gets print")

I tried to set a timeout alert for the function as suggested from this page "https://stackoverflow.com/questions/492519/timeout-on-a-function-call" but it didn't seem to work in this situation.

So.. What I'm trying to find out is

  1. Is there a way to avoid catastrophic backtracking in this scenario (finding email)?
  2. Is there a way to set timeout on regex function in python? (Ubuntu Linux)

Below is the complete code. Many Thanks!

import requests
import signal
import time
import re

EMAIL_REGEX = r"""([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)"""

class Timeout(Exception):
    pass

def handler(sig, frame):
    raise Timeout

signal.signal(signal.SIGALRM, handler)  # register interest in SIGALRM events


data = requests.get('http://www.netronixinc.com/products.aspx')
data = data.text
signal.alarm(2)
for re_match in re.finditer(EMAIL_REGEX, data):
    print("hurryout")
  • You might want to make sure that after you find `.com` `.ca` `.co.uk` ... that your aren't looking for any longer tails after that. You want to tell it that `.com` `.ca` ... is the end of the regex pattern. That might help – Sam Sep 08 '20 at 02:14
  • You seem to be downloading a large hex-encoded file with some html at the beginning and end. I don't think it contains what you think it does. – Frank Yellin Sep 08 '20 at 03:02
  • My solution was to limit the page size (requested page) to less than 1 million. so that am not wasting resources on large hex-encoded file. Thanks @FrankYellin – 張劭維 Sep 09 '20 at 02:41

0 Answers0