User defined regular expression security concerns

Question

Are there any security concerns if I run a user defined regular expression on my server with a user defined input string? I'm not asking about a single language, but any language really, with PHP as one of the main language I would like to know about.

For example, if I have the code below:

<?php

if(isset($_POST['regex'])) {
    preg_match($_POST['regex'], $_POST['match'], $matches);
    var_dump($matches);
}

?>
<form action="" method="post">
<input type="text" name="regex">
<textarea name="match"></textarea>
<input type="submit">
</form>

Providing this is not a controlled environment (i.e. the user can't be trusted), what are the risks of the above code? If a similar code is written for other languages, are there risks in these other languages? If so, which languages consist of threats?

I already found out about 'evil regular expressions', however, no matter what I try on my computer, they seem to work fine, see below.

PHP

<?php
php > preg_match('/^((ab)*)+$/', 'ababab', $matches);var_dump($matches);
array(3) {
  [0] =>
  string(6) "ababab"
  [1] =>
  string(0) ""
  [2] =>
  string(2) "ab"
}
php > preg_match('/^((ab)*)+$/', 'abababa', $matches);var_dump($matches);
array(0) {
}

JavaScript

phantomjs> /^((ab)*)+$/g.exec('ababab');
{
   "0": "ababab",
   "1": "ababab",
   "2": "ab",
   "index": 0,
   "input": "ababab"
}
phantomjs> /^((ab)*)+$/g.exec('abababa');
null

This leads me to believe that PHP and JavaScript have a fail-safe mechanism for evil regexes. Based on that, I would have that other languages have similar features.

Is this a correct assumption?

Finally, for any or all of the languages that may be harmful, are there any ways to make sure the regular expressions doesn't cause damage?

Those regexes are evil when used on maliciously crafted, very long strings. — SLaks, Jan 05 '14 at 00:50
With the `e` modifier (in PHP) something will be evaluated (what you probably don't want), see [the manual](http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php) — kero, Jan 05 '14 at 00:56
@SLaks, when you say "long strings", any idea how long we're talking about? — GManz, Jan 05 '14 at 01:30
@kingkero, I know about that, but that's been deprecated and only works with `preg_replace()` — GManz, Jan 05 '14 at 01:31

nhahtdh · Accepted Answer · 2014-01-05T07:04:36.490

When you are running user-defined regex with user-defined string on your side, it is possible for user to craft a catastrophic backtracking regex, usually with failing input to cause denial of service on your system.

Using your example ^((ab)*)+$, you need a slightly longer, failing input to cause catastrophic backtracking to take effect: "ababababababababababababababababababababababd".

For PHP version, a call to preg_last_error should return PREG_BACKTRACK_LIMIT_ERROR.
For JS version, the code above does not cause catastrophic backtracking in Firefox 26 and the browser returns false. On Chrome 31.0.1650.63 m and Internet Explorer 11, catastrophic backtracking can be observed.

Depending on the API of the language/library, the API may provide an option to limit the number of backtracking attempts or set time-out to the operation; it is strongly recommended that you set the limit in order to prevent DoS on your server.

PCRE defaults to stop after 10 million backtracking attempts, and the number can be configured.
.NET Regex class comes with an API to limit the time taken for matching.

If the language doesn't come with such convenient API, it is strongly recommended that you implement your own time out mechanism to time-out the execution.

Unless the specs of the regex engine includes requirement to prevent catastrophic backtracking (e.g. PCRE has a default backtracking limit), you shouldn't rely on the behavior of specific implementation (like the case of Firefox as described above).

User defined regular expression security concerns

1 Answers1