Regex vulnerable to polynomial runtime

Question

How do I improve this code? SonarQube is highlighting that the regex pattern that could become really slow and produce denial of service. Here's the code:

  // Single quotes
  // Double quotes
  // Control characters (x01 - x1f)
  private static final String SQL_INJECTION_PATTERN = ".*['\"\\x01-\\x1f].*";

  if (myValue.matches(SQL_INJECTION_PATTERN)) { // Here's the problem!
    ...
  }

I need to find if any of those chars is present, to take an action.

You seem to be using `.*` because `match` checks for complete matches. With `Matcher.find()` the `.* ` can be left out. That should fix the sonar warning, but does not fix the security problem you were trying to solve in the first place. To prevent SQL injections, use so called *bind variables* or *prepared statements*. — Socowi, Apr 13 '23 at 19:30

rzwitserloot · Accepted Answer · 2023-04-13T19:27:30.627

Just ignore the warning. There aren't many regexes that have potentially exponential performance characteristics. If you do not let your users choose the regex (i.e. imagine you have a search feature where the user enters a regex or something you use as part of a regex in a search-for-this form field and they can submit it, then you'd be vulnerable here) and you don't try to apply such a regex that you wrote yourself to user input (trivially fixed by.. not having any such regexes in your code base)...

it doesn't apply to you.

But you do have a security leak here

You're engaging in an act known as denylisting. An older term for this is 'blacklisting'. You list things that you know are bad, and scan for these. Any input that contains bad things is disallowed. The rest is allowed.

Effectively you're saying: Unless I can show conclusively it is bad, we assume it is good.

This is a broken approach to security.

Allowlisting is the correct approach: Unless you can prove the input is safe, do not run it.

Specifically, for SQL injection pattern, the solution is to let the SQL engine take care of it: Make PreparedStatements, and use .setX to set them - or use a library like JDBI or JOOQ that does this for you. Then there is simply no reason to worry about SQL injection whatsoever.

Note that the actual security leak you do have is of the 'the hacker will p0wn the entire box by gaining admin creds on it' variety. Compared to 'users might be able to DOS your box', it's.. a few million times worse.

EDIT: Here is an example of an allowlist approach in the same vein as the code you have (and the same caveat: Your security scanner may complain about a potential DOS attack venue here, you can ignore it, or.. see later paragraph):


private static final String SQL_INJECTION_PATTERN = ".*[a-zA-Z0-9_ -].*";

  if (!myValue.matches(SQL_INJECTION_PATTERN)) throw new IllegalArgumentException();

This code will check if the input consists solely of a very specific limited range of alphanums (just the ascii chars), and dash, and underscore, and space, and disallows everything that doesn't consist just of those things.

This may still be unsafe - it depends on what you do with it. SQL has keywords, for example. If certain keywords are bad who knows - you'd need to elaborate considerably on what you're doing with user input here.

The central point stands: It is highly likely your entire approach is wrong and you need to be using preparedstatements.

One way to get rid of the warning is to stop using regexes. You can do this kind of scan just as easily with a for loop (and it is faster too, if anything):

for (int i = 0; i < sql.length(); i++) {
  if ((i < 'a' || i > 'z') && (i < 'A' && i > 'Z') && (i < '0' || i > '9') && i != ' ' && i != '_' && i != ' ') throw new IllegalArgumentException();

But, as a general rule, rewriting code solely because you're trying to work around an overly aggressive linting tool is an incredibly perniciously dangerous thing to do to a codebase. Programming is hard enough as is.

Thank you for the answer. Yes, I agree with the vulnerability. I'll see what can I do to limit to safe patterns instead, if possible. — Joe DiNottra, Apr 13 '23 at 19:24
For reference see the OWASP page on SQL Injection. Deny lists are not even an option nor alternative: https://cheatsheetseries.owasp.org/cheatsheets/SQL_Injection_Prevention_Cheat_Sheet.html — aled, Apr 13 '23 at 19:26
I added an example of what allowlisting looks like, reiterating _again_ that (going by very little detail here) it is __very__ unlikely trying to scan the SQL in the first place (whether by allowlisting or denylisting) is a good idea and instead, use PreparedStatement. — rzwitserloot, Apr 13 '23 at 19:30
@rzwitserloot "...rewriting code solely because you're trying to work around an overly aggressive linting tool is an incredibly perniciously dangerous thing to do to a codebase..." -- Quite true. I'm new to SonarQube, and I was just wondering about that. Probably 25% of its findings are correct; the other 75% are more like warnings (some of them useful) but many don't really apply and I should avoid taking action on them. — Joe DiNottra, Apr 13 '23 at 19:33
Anytime you use any linting tool you have to ensure there is a clear process on how to ignore it, and it can't be too onerous. Generally they have ways to mark lines off as 'person X reviewed the potential vuln marked by scanner Y and determined it did not apply', it has to be that easy (just a specially formatted comment, or a separate file, but separate files run into the issue of having a hard time tracking along with source file changes - they can't (easily) use line numbers, for example). — rzwitserloot, Apr 13 '23 at 19:35
Some linting tools make this hard or impossible. In my professional opinion such linting tools are completely unusable. — rzwitserloot, Apr 13 '23 at 19:35
I think you can change the first `.*` to a negated character class `[^a-zA-Z0-9_ -]*` to not over match it like `^[^a-zA-Z0-9_ -]*[a-zA-Z0-9_ -].*` — The fourth bird, Apr 13 '23 at 19:45

motto · Answer 2 · 2023-04-13T20:12:12.467

The "vulnerability" (such as it is) is not one of those chars, but the leading .*. tldr: you could change it to .*?

Consider an input string with none of the suspect characters, e.g. xxxxxx. In trying to match this string against .*['"\x01-\x1f].*, the regex engine will attempt to match the leading .* greedily, so it tries:


xxxxxx (fails to match because there is no ['"\x01-\x1f] past the end of the input)
xxxxx  (fails to match because the next character is x which is not ['"\x01-\x1f])
xxxx   (ditto)
xxx    (ditto)
xx     (ditto)
x      (ditto)
       (ditto)
  -> no match found

The result of the greedy matching up front is that the engine must backtrack repeatedly, so determining the lack of match takes time superlinear(?) in the length of the input. The regex analyser will be being very conservative when it comes to evaluating the "risk".

The easy workaround is to use a non-greedy quantifier on the leading .*, i.e. .*?['"\x01-\x1f].*. The regex engine does not backtrack, so the input is evaluated in linear time.

On a side-note, this type of "potential ReDoS" pattern is reminiscent to one that was reported in AngularJS's angular.copy a couple of weeks back (and indeed in lodash's clone machinery for RegExps, and probably countless other libraries that use the same quick trick to extract flags from the end of a stringified RegExp).

Unrelated to the mechanics of this, I agree with rzwitserloot that this stuff tends to be a distraction from other more significant vulnerabilities. Maybe the real "denial of service" of each of these reports of a "regular expression denial of service" is the amount of time wasted every time while folks around the world scramble around trying to ascertain whether there's even an associated attack vector.

This is rather bizarre; there is no difference whatsoever in the behaviour of a regex _match_ operation when using greedy vs. non-greedy operators; the only point to greedyness modifiers is when you are finding subsets (e.g. in java speak using `matcher.find()` instead of `matcher.matches()`), or if matching groups (let's assume the regex contains no backrefs). Is the regex engine in java this crappy, that greediness ops change the perf. characteristics? — rzwitserloot, Apr 13 '23 at 19:37
I'm a bit afraid of the latter; when backrefs/grouping isn't relevant, [Thompson's Construction](https://en.wikipedia.org/wiki/Thompson%27s_construction) can be used to guarantee O(n+M) behaviour. But the java regex engine doesn't consider it worthwhile to apply it in such cases (possibly correctly; tons of regex usage is for the group matching, and backrefs _are_ part of the spec). It may similarly decide to backtrack is such an utterly useless fashion too. — rzwitserloot, Apr 13 '23 at 19:39
@rzwitserloot It is certainly bizarre, but the effect of using the non-greedy quantifier is there – although it appears sub-quadratic so I'll amend the answer. It's definitely more distinctive in the [JS engine](https://jsben.ch/dQYna) but just greedy vs non-greedy on an isomorphic Java test gives me about a 2x speed difference, if you're into anecdata. The AngularJS ReDoS from the other day though has a [very stark difference](https://jsben.ch/79ofN) between greedy/non-greedy prefix matching, and that is also borne out by Java. — motto, Apr 13 '23 at 20:10
Well, that's... good to know. Thank you very much for the insights, @motto! — rzwitserloot, Apr 13 '23 at 21:24

Regex vulnerable to polynomial runtime

2 Answers2

But you do have a security leak here