See Runaway Regular Expressions: Catastrophic Backtracking.
In brief, if there are extremely many combinations a substring can be split into the parts of the regex, the regex matcher may end up trying them all.
Constructs like (x+)+
and x+x+
practically guarantee this behaviour.
To detect and fix the problematic constructs, the following concept can be used:
At conceptual level, the presence of a problematic construct means that your regex is ambiguous - i.e. if you disregard greedy/lazy behaviour, there's no single "correct" split of some text into the parts of the regex (or, equivalently, a subexpression thereof). So, to avoid/fix the problems, you need to see and eliminate all ambiguities.
One way to do this is to
- always split the text into its meaningful parts (=parts that have separate meanings for the task at hand), and
- define the parts in such a way that they cannot be confused (=using the same characteristics that you yourself would use to tell which is which if you were parsing it by hand)