I have a code to scrape a million websites and detect contact info from their homepage.
For some reasons, when I run code, it gets stuck and does not proceed after crawling about 60k requests, I am marking the website URLs in my DB as status=done
I have run code several times but it gets stuck around 60k requests.
It doesnt get stuck on a certain website.
Here is Regex I am using
emails = re.findall('[\w\.-]+@[\w-]+\.[\w\.-]+', lc_body)
mobiles = re.findall(r"(\(?(?<!\d)\d{3}\)?-? *\d{3}-? *-?\d{4})(?!\d)|(?<!\d)(\+\d{11})(?!\d)", lc_body)
abns = re.findall('[a][-\.\s]??[b][-\.\s]??[n][-\:\.\s]?[\:\.\s]?(\d+[\s\-\.]?\d+[\s\-\.]?\d+[\s\-\.]?\d+)', lc_body)
licences = re.findall(r"(Licence|Lic|License|Licence)\s*(\w*)(\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+)(\d+)", lc_body, re.IGNORECASE)
My thought is licences
's regex is causing issues, how can I simplify it? How can I remove Backtracking ?
I want to find all Licence numbers possible.
It can be License No: 2543
, License: 2543
, License # 2543
, License #2543
, License# 2543
and many other combinations as well.