Prevent Catastrophic Backtracking in Regex

Question

I have a code to scrape a million websites and detect contact info from their homepage.

For some reasons, when I run code, it gets stuck and does not proceed after crawling about 60k requests, I am marking the website URLs in my DB as status=done

I have run code several times but it gets stuck around 60k requests.

It doesnt get stuck on a certain website.

Here is Regex I am using

    emails = re.findall('[\w\.-]+@[\w-]+\.[\w\.-]+', lc_body)
    mobiles = re.findall(r"(\(?(?<!\d)\d{3}\)?-? *\d{3}-? *-?\d{4})(?!\d)|(?<!\d)(\+\d{11})(?!\d)", lc_body)
    abns = re.findall('[a][-\.\s]??[b][-\.\s]??[n][-\:\.\s]?[\:\.\s]?(\d+[\s\-\.]?\d+[\s\-\.]?\d+[\s\-\.]?\d+)', lc_body)

    licences = re.findall(r"(Licence|Lic|License|Licence)\s*(\w*)(\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+)(\d+)", lc_body, re.IGNORECASE)

My thought is licences's regex is causing issues, how can I simplify it? How can I remove Backtracking ?

I want to find all Licence numbers possible.

It can be License No: 2543 , License: 2543, License # 2543, License #2543, License# 2543 and many other combinations as well.

The best practice is to use alternatives in an alternation group that do not match at the same location. Please provide a string that causes issues. — Wiktor Stribiżew, Apr 26 '17 at 18:49
`(Licence|Lic|License|Licence)` => `(Licence|Lic|License)` for starters :) — Jean-François Fabre, Apr 26 '17 at 18:50
@Jean-FrançoisFabre: No, `Lic(?:en[cs]e)?`. The issue is caused with the third group though, `(\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+)` - all alternatives start with `\s*`. What is it supposed to match? — Wiktor Stribiżew, Apr 26 '17 at 18:50
also: wouldn't it be better to compile the regexes and use the compiled version? Would certainly save some time. — Jean-François Fabre, Apr 26 '17 at 18:51
@WiktorStribiżew I want to extract Licence numbers ... `License No: 2543` , `License: 2543`, `License # 2543`, `License #2543`, `License# 2543` and there can be many other combinations as well ... — Umair Ayub, Apr 26 '17 at 18:58
Aha, so the `.` is a literal one? It must be escaped then. See [`Lic(?:en[cs]e)?(?:\s+\w+:?)?(?:\s*[#.:-])?\s*\d+`](https://regex101.com/r/rhKePl/1) — Wiktor Stribiżew, Apr 26 '17 at 19:02
@WiktorStribiżew can you please post a detailed answer on that? And any other suggestions you have in mind — Umair Ayub, Apr 26 '17 at 19:17

score 2 · Accepted Answer · edited Sep 24 '17 at 14:49

The issue is caused with the third group: (\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+) - all alternatives start with \s* here. This causes lots of redundant backtracking as these alternatives can match at the same location in a string. The best practice is to use alternatives in an alternation group that do not match at the same location.

Now, looking at the strings you need to match, I suggest using

Lic(?:en[cs]e)?(?:\W*No:)?\W*\d+

See the regex demo

Make the pattern more specific and linear, get rid of as many alternations as possible, use optional non-capturing groups and character classes.

Details:

Lic(?:en[cs]e)? - Lic followed with 1 or 0 occurrences (the (?:...)? is an optional non-capturing group since ? quantifier matches 1 or 0 occurrences of the quantified subpatterns) of ence or ense (the character class [sc] matches either s or c and is much more efficient than (s|c))
(?:\W*No:)? - a non-capturing group that matches 1 or 0 occurrences of 0+ non-word chars (with \W*) followed with No: substring
\W*
\d+ - 1 or more digits.

Prevent Catastrophic Backtracking in Regex

1 Answers1