How can I programmatically identify evil regexes?

Question

Is there an algorithm to determine whether a given JavaScript regex is vulnerable to ReDoS? The algorithm doesn't have to be perfect - some false positives and false negatives are acceptable. (I'm specifically interested in ECMA-262 regexes.)

A more pragmatic solution may be to stop the evaluation of the expression after a specific timeout, like 1 second. Maybe sample it a few times to ensure it wasn't just a busy host machine. That greatly depends on your runtime environment though. — deceze, Dec 02 '15 at 12:20
I want to programmatically inform developers that their regex may be evil before the regex gets used in production, so a timeout isn't an option. — fblundun, Dec 02 '15 at 12:48
Translate the regex into a NFA, and then try to convert that into a DFA, bailing out at some subjective complexity limit (automaton size). — Bergi, Dec 02 '15 at 17:17
@Bergi that's an interesting idea... But won't `^a*$`, `^a*a*$`, `^(a*a*)*$`, and `^(a|a)*$` all translate to the same DFA, even though only the first two are benign? — fblundun, Dec 03 '15 at 09:09
@fblundun: Oh right, that's not enough, as the DFA will be normalised. I guess that step is the interesting one: when you're canonicalising the NFA and loose states in that process, then something is wrong with the regex. Though if the DFA is growing exponentially, there also is a problem. — Bergi, Dec 03 '15 at 15:39
@deceze how do you limit/stop regex execution in javascript? — omer, Jun 30 '19 at 08:19

score 2 · Answer 1 · edited May 23 '17 at 12:32

It is hard to verify whether a regexp is evil or not without actually running it. You could try detecting some of the patterns detailed in the Wiki and generalise them:

e.g. For

(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x > 10

You could check for )+ or )* or ){ sequences and validate against them. However, I guarantee that an attacker will find their way round them.

In essence it is a minefield to allow user set regexps. However, if you can timeout the regexp search, terminate the thread and then mark that regexp as "bad" you can mitigate the threat somewhat. In the case that the regexp is used later, maybe you could validate it by running it against an expected input at point of entry?

Later you will still need to be able to terminate it if the text evaluated at the later stage has a different effect with your regexp and mark it as bad so it will not be used again without user intervention.

I don't think the halting problem applies to regular expressions. Every formal regular expression halts on every input. As far as I know this is also true of JavaScript's extended regular expressions (see [this question](http://stackoverflow.com/questions/1241215/do-all-regular-expressions-halt)). — fblundun, Dec 02 '15 at 16:32

score 0 · Answer 2 · answered Aug 05 '16 at 10:08

TL;DR sort of, but not fully

In [9]: re.compile("(a+)+", re.DEBUG)
max_repeat 1 4294967295
  subpattern 1
    max_repeat 1 4294967295
      literal 97

Note those nested repeat 1..N, for large N, that's bad.

This takes care of all Wikipedia examples, except (a|aa)+and a*b?a*x.

Likewise it's hard to account for back-references, if your engine supports those.

IMO evil regexp is combination of two factors: combinatorial explosion and oversight in engine implementation. Thus, worst case also depends on regexp engine and sometimes flags. Backtracking is not always easy to identify.

Simple cases, however, can be identified.

How can I programmatically identify evil regexes?

2 Answers2