115

I recently became aware of Regular expression Denial of Service attacks, and decided to root out so-called 'evil' regex patterns wherever I could find them in my codebase - or at least those that are used on user input. The examples given at the OWASP link above and wikipedia are helpful, but they don't do a great job of explaining the problem in simple terms.

A description of evil regexes, from wikipedia:

  • the regular expression applies repetition ("+", "*") to a complex subexpression;
  • for the repeated subexpression, there exists a match which is also a suffix of another valid match.

With examples, again from wikipedia:

  • (a+)+
  • ([a-zA-Z]+)*
  • (a|aa)+
  • (a|a?)+
  • (.*a){x} for x > 10

Is this a problem that just doesn't have a simpler explanation? I'm looking for something that would make it easier to avoid this problem while writing regexes, or to find them within an existing codebase.

Mike Partridge
  • 5,128
  • 8
  • 35
  • 47
  • 8
    Another link about this topic is this one: http://www.regular-expressions.info/catastrophic.html – Daniel Hilgarth Oct 11 '12 at 14:36
  • 1
    Here's a tool for performing static analysis on regular expressions to discover suspected ReDoS problems: http://www.cs.bham.ac.uk/~hxt/research/rxxr.shtml – tripleee Jan 21 '19 at 13:52
  • The link provided by @tripleee appears to have a broken link to the RXXR tool. Here's a GitHub mirror: https://github.com/ConradIrwin/rxxr2/ – Mike Hill Feb 26 '19 at 18:19
  • 3
    Additionally, for those curious, it looks like the authors of the original RXXR tool superseded it with RXXR2. Their new page is hosted here and does currently have a working link to the RXXR2 source: http://www.cs.bham.ac.uk/~hxt/research/rxxr2/ – Mike Hill Feb 26 '19 at 18:25

8 Answers8

115

Why Are Evil Regexes A Problem?

Because computers do exactly what you tell them to do, even if it's not what you meant or is totally unreasonable. If you ask a regex engine to prove that, for some given input, there either is or is not a match for a given pattern, then the engine will attempt to do that no matter how many different combinations must be tested.

Here is a simple pattern inspired by the first example in the OP's post:

^((ab)*)+$

Given the input:

abababababababababababab

The regex engine tries something like (abababababababababababab) and a match is found on the first try.

But then we throw the monkey wrench in:

abababababababababababab a

The engine will first try (abababababababababababab) but that fails because of that extra a. This causes catastrophic backtracking, because our pattern (ab)*, in a show of good faith, will release one of its captures (it will "backtrack") and let the outer pattern try again. For our regex engine, that looks something like this:

(abababababababababababab) - Nope
(ababababababababababab)(ab) - Nope
(abababababababababab)(abab) - Nope
(abababababababababab)(ab)(ab) - Nope
(ababababababababab)(ababab) - Nope
(ababababababababab)(abab)(ab) - Nope
(ababababababababab)(ab)(abab) - Nope
(ababababababababab)(ab)(ab)(ab) - Nope
(abababababababab)(abababab) - Nope
(abababababababab)(ababab)(ab) - Nope
(abababababababab)(abab)(abab) - Nope
(abababababababab)(abab)(ab)(ab) - Nope
(abababababababab)(ab)(ababab) - Nope
(abababababababab)(ab)(abab)(ab) - Nope
(abababababababab)(ab)(ab)(abab) - Nope
(abababababababab)(ab)(ab)(ab)(ab) - Nope
(ababababababab)(ababababab) - Nope
(ababababababab)(abababab)(ab) - Nope
(ababababababab)(ababab)(abab) - Nope
(ababababababab)(ababab)(ab)(ab) - Nope
(ababababababab)(abab)(abab)(ab) - Nope
(ababababababab)(abab)(ab)(abab) - Nope
(ababababababab)(abab)(ab)(ab)(ab) - Nope
(ababababababab)(ab)(abababab) - Nope
(ababababababab)(ab)(ababab)(ab) - Nope
(ababababababab)(ab)(abab)(abab) - Nope
(ababababababab)(ab)(abab)(ab)(ab) - Nope
(ababababababab)(ab)(ab)(ababab) - Nope
(ababababababab)(ab)(ab)(abab)(ab) - Nope
(ababababababab)(ab)(ab)(ab)(abab) - Nope
(ababababababab)(ab)(ab)(ab)(ab)(ab) - Nope
                              ...
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab) - Nope

The number of possible combinations scales exponentially with the length of the input and, before you know it, the regex engine is eating up all your system resources trying to solve this thing until, having exhausted every possible combination of terms, it finally gives up and reports "There is no match." Meanwhile your server has turned into a burning pile of molten metal.

How to Spot Evil Regexes

It's actually very tricky. Catastrophic backtracking in modern regex engines is similar in nature to the halting problem which Alan Turing proved was impossible to solve. I have written problematic regexes myself, even though I know what they are and generally how to avoid them. Wrapping everything you can in an atomic group can help to prevent the backtracking issue. It basically tells the regex engine not to revisit a given expression - "lock whatever you matched on the first try". Note, however, that atomic expressions don't prevent backtracking within the expression, so ^(?>((ab)*)+)$ is still dangerous, but ^(?>(ab)*)+$ is safe (it'll match (abababababababababababab) and then refuse to give up any of it's matched characters, thus preventing catastrophic backtracking).

Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. In the end, recognizing a bad regex is like recognizing any other bad code - it takes a lot of time and experience and/or a single catastrophic event.


Interestingly, since this answer was first written, a team at the University of Texas at Austin published a paper describing the development of a tool capable of performing static analysis of regular expressions with the express purpose of finding these "evil" patterns. The tool was developed to analyse Java programs, but I suspect that in the coming years we'll see more tools developed around analysing and detecting problematic patterns in JavaScript and other languages, especially as the rate of ReDoS attacks continues to climb.

Static Detection of DoS Vulnerabilities in Programs that use Regular Expressions
Valentin Wüstholz, Oswaldo Olivo, Marijn J. H. Heule, and Isil Dillig
The University of Texas at Austin

khelwood
  • 55,782
  • 14
  • 81
  • 108
JDB
  • 25,172
  • 5
  • 72
  • 123
  • 1
    This is a very good answer in describing /why/ the example regex takes a long time, but I'm looking for a few rules that a person can internalize to help recognize a problem regex. – Mike Partridge Oct 12 '12 at 15:53
  • 5
    Knowing "why" is the most important step toward avoiding writing an "evil" regex. Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. If you want a blanket fix, atomic grouping is usually the best way round, but that can have a significant impact on the patterns the regex will match. In the end, recognizing a bad regex is like regex any other bad code - it takes a lot of experience, a lot of time and/or a single catastrophic event. – JDB Oct 12 '12 at 19:31
  • This is why my preference is for regex engines that do not support backtracking without the user forcing it. I.E. lex/flex. – Spencer Rathbun May 29 '13 at 21:20
  • 2
    @MikePartridge it's the common IT classic theory problem, to decide whether some code will loop infinitely or stop is an NP-complete kind of problem. With regexes you can probably guess/catch some of them by searching for certain patterns/rules, but unless you do some heavy NP-complete analysis, you will never catch them all. Some options: 1) never let user enter regexp to your server. 2) configure regexp engine to terminate calculation early enough (but test your valid regex in your code still work, even with stringent limits). 3) run regex code in low-priority thread with cpu/mem limits. – Ped7g May 09 '17 at 15:42
  • 1
    @MikePartridge - recently came across a paper about some new tools that are being developed to detect these problematic regexes statically. Interesting stuff... I think it'll be worth following. – JDB Aug 20 '18 at 17:42
  • Regarding @JDB's last point: a google search I did in parallel yielded the following javascript package: https://github.com/Wisdom/RegEx-DoS, which is itself based on https://github.com/substack/safe-regex. I haven't tried it yet myself but it seems like a promising starting point to at least provide some static coverage for this issue. – Moshe Jonathan Gordon Radian Jun 25 '19 at 15:13
17

Detecting evil regexes

  1. Try Nicolaas Weideman's RegexStaticAnalysis project.
  2. Try my ensemble-style vuln-regex-detector which has a CLI for Weideman's tool and others.

Rules of thumb

Evil regexes are always due to ambiguity in the corresponding NFA, which you can visualize with tools like regexper.

Here are some forms of ambiguity. Don't use these in your regexes.

  1. Nesting quantifiers like (a+)+ (aka "star height > 1"). This can cause exponential blow-up. See substack's safe-regex tool.
  2. Quantified Overlapping Disjunctions like (a|a)+. This can cause exponential blow-up.
  3. Avoid Quantified Overlapping Adjacencies like \d+\d+. This can cause polynomial blow-up.

Additional resources

I wrote this paper on super-linear regexes. It includes loads of references to other regex-related research.

James Davis
  • 848
  • 8
  • 12
16

What you call an "evil" regex is a regex that exhibits catastrophic backtracking. The linked page (which I wrote) explains the concept in detail. Basically, catastrophic backtracking happens when a regex fails to match and different permutations of the same regex can find a partial match. The regex engine then tries all those permutations. If you want to go over your code and inspect your regexes these are the 3 key issues to look at:

  1. Alternatives must be mutually exclusive. If multiple alternatives can match the same text then the engine will try both if the remainder of the regex fails. If the alternatives are in a group that is repeated, you have catastrophic backtracking. A classic example is (.|\s)* to match any amount of any text when the regex flavor does not have a "dot matches line breaks" mode. If this is part of a longer regex then a subject string with a sufficiently long run of spaces (matched by both . and \s) will break the regex. The fix is to use (.|\n)* to make the alternatives mutually exclusive or even better to be more specific about which characters are really allowed, such as [\r\n\t\x20-\x7E] for ASCII printables, tabs, and line breaks.

  2. Quantified tokens that are in sequence must either be mutually exclusive with each other or be mutually exclusive what comes between them. Otherwise both can match the same text and all combinations of the two quantifiers will be tried when the remainder of the regex fails to match. A classic example is a.*?b.*?c to match 3 things with "anything" between them. When c can't be matched the first .*? will expand character by character until the end of the line or file. For each expansion the second .*? will expand character by character to match the remainder of the line or file. The fix is to realize that you can't have "anything" between them. The first run needs to stop at b and the second run needs to stop at c. With single characters a[^b]*+b[^c]*+c is an easy solution. Since we now stop at the delimiter, we can use possessive quantifiers to further increase performance.

  3. A group that contains a token with a quantifier must not have a quantifier of its own unless the quantified token inside the group can only be matched with something else that is mutually exclusive with it. That ensures that there is no way that fewer iterations of the outer quantifier with more iterations of the inner quantifier can match the same text as more iterations of the outer quantifier with fewer iterations of the inner quantifier. This is the problem illustrated in JDB's answer.

While I was writing my answer I decided that this merited a full article on my website. This is now online too.

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
11

I would sum it up as "A repetition of a repetition". The first example you listed is a good one, as it states "the letter a, one or more times in a row. This can again happen one or more times in a row".

What to look for in this case is combination of the quantifiers, such as * and +.

A somewhat more subtle thing to look out for is the third and fourth one. Those examples contain an OR operation, in which both sides can be true. This combined with a quantifier of the expression can result in a LOT of potential matches depending on the input string.

To sum it up, TLDR-style:

Be careful how quantifiers are used in combination with other operators.

Jarmund
  • 3,003
  • 4
  • 22
  • 45
  • 3
    Currently, this answer comes closest to what I'm looking for: a rule of thumb for recognizing a regex that could cause catastrophic backtracking. – Mike Partridge Oct 12 '12 at 11:52
  • 1
    What you left out, and what seems to be an important part of the problem, is capturing groups. – Mike Partridge Oct 12 '12 at 11:57
  • @MikePartridge That too. I tried to boil it down as much as possible, so there are other things that can cause the same things, such as capturing groups. – Jarmund Oct 12 '12 at 15:49
7

I have surprisingly come across ReDOS quite a few times performing source code reviews. One thing I would recommend is to use a timeout with whatever Regular Expression engine that you are using.

For example, in C# I can create the regular expression with a TimeSpan attribute.

string pattern = @"^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$";
Regex regexTags = new Regex(pattern, RegexOptions.None, TimeSpan.FromSeconds(1.0));
try
{
    string noTags = regexTags.Replace(description, "");
    System.Console.WriteLine(noTags);
} 
catch (RegexMatchTimeoutException ex)
{
    System.Console.WriteLine("RegEx match timeout");
}

This regex is vulnerable to denial of service and without the timeout will spin and eat resources. With the timeout, it will throw a RegexMatchTimeoutException after the given timeout and will not cause the resource usage leading to a Denial of Service condition.

You will want to experiment with the timeout value to make sure it works for your usage.

Casey
  • 12,070
  • 18
  • 71
  • 107
4

I would say this is related to the regex engine in use. You may not always be able to avoid these types of regexes, but if your regex engine is built right, then it is less of a problem. See this blog series for a great deal of information on the topic of regex engines.

Note the caveat at the bottom of the article, in that backtracking is an NP-Complete problem. There currently is no way to efficiently process them, and you might want to disallow them in your input.

Spencer Rathbun
  • 14,510
  • 6
  • 54
  • 73
  • `a*a*` doesn't use backreferences. Now, the regex engine uses *backtracking*, which is, perhaps, what you meant? In which case, all modern engines use backtracking. You can easily disable backtracking via `(?>...)`, but that will more often then not change the meaning of your expression (and in some cases it can be circumvented). – JDB May 29 '13 at 21:04
  • @Cyborgx37 whoops! I did mean backtracking. Fixed. – Spencer Rathbun May 29 '13 at 21:17
  • In that case, the engine either uses backtracking or it doesn't. There is virtually no way to restrict backtracking by restricting the input. – JDB May 29 '13 at 21:19
  • 2
    @JDB: "all modern engines use backtracking." - Maybe that was true in 2013, but [not anymore](https://github.com/google/re2/wiki/WhyRE2). – Kevin Jun 13 '19 at 20:58
  • @Kevin - sure. you win. – JDB Jun 13 '19 at 21:41
3

I don't think you can recognize such regexes, at least not all of them or not without restrictively limiting their expressiveness. If you'd really care about ReDoSs, I'd try to sandbox them and kill their processing with a timeout. It also might be possible that there are RegEx implementations that let you limit their max backtracking amount.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • 2
    I think you're misunderstanding the question. As I read it, the OP is literally asking how *he* can recognize an evil regex, not how he can write a program to do so. Like, "I've written this regex, but how can I tell if it could be evil?" – ruakh Oct 11 '12 at 15:14
  • Uh, you could be right. I then can only recommend the article about catastrophic backtracking which @DanielHilgarth already linked in the comments. – Bergi Oct 11 '12 at 15:23
  • @ruakh each regex with `*`in it is vulnerable , isn't it? – 0x90 May 29 '13 at 20:43
  • @0x90: I wouldn't say that, no. – ruakh May 29 '13 at 20:48
  • 2
    @0x90: Because I don't consider e.g. `a*` or `\*` to be "vulnerable". – ruakh May 29 '13 at 20:57
  • 1
    @0x90 `a*` is not vulnerable at all. Meanwhile, `a{0,1000}a{0,1000}` is a catastrophic regex waiting to happen. Even `a?a?` can have nasty results under the right conditions. – JDB May 29 '13 at 21:00
  • I meant any regex with at least two literals and `*` or `+` in it, I am trying to think is there any family of state machines which are dangerous . I think you can define regex that may be vulnerable, according the number of state in its state machine / transition function. – 0x90 May 29 '13 at 21:02
  • 2
    @0x90 - Catastrophic backtracking is a danger whenever you have two expressions where one is identical to or a subset of the other, where the length of the expression is variable and where they are positioned such that one could give up one or more characters to the other via backtracking. For example, `a*b*c*$` is safe, but `a*b*[ac]*$` is dangerous, because `a*` could potentially give up characters to `[ac]*` if `b` is absent and the initial match fails (e.g. `aaaaaaaaaaaccccccccccd`). – JDB May 29 '13 at 21:11
0

There are some ways I can think of that you could implement some simplification rules by running them on small test inputs or analyzing the regex's structure.

  • (a+)+ can be reduced using some sort of rule for replacing redundant operators to just (a+)
  • ([a-zA-Z]+)* could also be simplified with our new redundancy combining rule to ([a-zA-Z]*)

The computer could run tests by running the small subexpressions of the regex against randomly-generated sequences of the relevant characters or sequences of characters, and seeing what groups they all end up in. For the first one, the computer is like, hey the regex wants a's, so lets try it with 6aaaxaaq. It then sees that all the a's, and only the first groupm end up in one group, and concludes that no matter how many a's is puts, it won't matter, since + gets all in the group. The second one, is like, hey, the regex wants a bunch of letters, so lets try it with -fg0uj=, and then it sees that again each bunch is all in one group, so it gets rid of the + at the end.

Now we need a new rule to handle the next ones: The eliminate-irrelevant-options rule.

  • With (a|aa)+, the computer takes a look at it and is like, we like that big second one, but we can use that first one to fill in more gaps, lets get ans many aa's as we can, and see if we can get anything else after we're done. It could run it against another test string, like `eaaa@a~aa.' to determine that.

  • You can protect yourself from (a|a?)+ by having the computer realize that the strings matched by a? are not the droids we are looking for, because since it can always match anywhere, we decide that we don't like things like (a?)+, and throw it out.

  • We protect from (.*a){x} by getting it to realize that the characters matched by a would have already been grabbed by .*. We then throw out that part and use another rule to replace the redundant quantifiers in (.*){x}.

While implementing a system like this would be very complicated, this is a complicated problem, and a complicated solution may be necessary. You should also use techniques other people have brought up, like only allowing the regex some limited amount of execution resources before killing it if it doesn't finish.

AJMansfield
  • 4,039
  • 3
  • 29
  • 50
  • 1
    "being like", recognizing what something "wants", "trying" guesses, "seeing" and drawing conclusions ("realizing", "determining") are non-trivial problems that are hard to implement algorithmically for computers… And testing examples is nothing to rely on, you rather would need some kind of proving. – Bergi May 29 '13 at 22:30
  • @Bergi What I meant by the test examples was that you take a tiny chunk of a complete regex, and run that against a test string, as a simple way to determine how it behaves. Of course, you are only testing chunks that you have examined and already know don't do weird stuff in the test cases. – AJMansfield May 30 '13 at 00:49