2

From http://php.net/manual/en/function.preg-quote.php:

preg_quote() takes str and puts a backslash in front of every character that is part of the regular expression syntax. This is useful if you have a run-time string that you need to match in some text and the string may contain special regex characters.

The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -

Note that / is not a special regular expression character.

} is unnecessary but I can understand why they'd include it for symmetry. E.g. the following code works:

$re = '/}{This is fine}{/';
preg_match($re, $re, $match);
var_dump($match);

The output is:

array(1) {
  [0] =>
  string(16) "}{This is fine}{"
}

Why do they include = ! < > :? As far as I can tell, they're only ever special after being introduced by another unescaped meta character, e.g. immediately after (?, both of which characters also get escaped. : can also be special inside character classes like so: [[:alpha:]], but all four brackets get escaped.

Community
  • 1
  • 1
CJ Dennis
  • 4,226
  • 2
  • 40
  • 69
  • 3
    Because it's far easier to escape all of them all of the time instead of trying to figure out the context in which they might occur. If you don't want them escaped, then don't run them through `preg_quote()`. – Sammitch Feb 08 '18 at 01:39
  • @Sammitch In what context would not escaping them cause the regex to fail? – CJ Dennis Feb 08 '18 at 01:40
  • If it doesn't mean a special character in your sample code, it doesn't mean it wouldn't be at all. FYI, from v7.3 on, `preg_quote()` escapes `#` as well. – revo Feb 17 '18 at 23:33
  • @revo It would be a lot simpler if they just quoted all [\x00-\x2F\x3A-\x40\x5B-\x60\7B-7F] characters (I can't remember the code for underscore [_], take it out). Then they wouldn't have to worry about changing the code in the future. Note that space (\x20), tab (\x09), CR (\x0D) and LF (\x0A) are special characters if you use the /x modifier. – CJ Dennis Feb 18 '18 at 22:33
  • Escaping lots of characters is not simple but running out of idea. Current C code which is responsible for this escaping is a `switch` statement with 21 `case`s. `#` is added to the list in accordance with `PCRE_EXTENDED` (`x`) modifier. In this mode whitespaces are ignored, almost totally. – revo Feb 18 '18 at 22:55

2 Answers2

2

I think that the idea behind is to have a consistent behaviour.

The goal of preg_quote is to produce a literal string for a regex pattern. This means that no character in the returned string can be interpreted as something else than itself whatever the context, and the context can be a concatenation with an other part of the pattern.

If I write '/(?' . preg_quote('>') . 'abc)/', I expect that the > will not be interpreted as the > of an atomic group, and that the pattern returns an error.

If I write '/.{3' . preg_quote('}') . '/', I expect that the } will not be interpreted as the closing curly bracket of a quantifier, and that the pattern matches a string like 'a{3}', but not 'abc'.

You can easily build the same kind of examples for = ! < > : using lookahead assertions, named groups, non-capturing groups, or atomic groups.

The important is that the expected behaviour is always the same whatever the way or the context in which the function is used.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Your example is bad. If `{` isn't supposed to be meta, it should be escaped in the pattern. – CJ Dennis Feb 08 '18 at 04:47
  • @CJDennis: No. `{` isnt' supposed to be anything in my example, a meta or not a meta. The problem is that you always think about working patterns without to consider wrong patterns and edge cases. Moreover, the opening curly bracket is a particular case in pcre since it becomes a meta only with the context: `/.{3\}/` matches `'a{3}'` and `/.{abc}/` matches `'z{abc}'` without error. I will choose an other character of the list for the example. – Casimir et Hippolyte Feb 08 '18 at 05:44
  • @cale_b: you are right, it's incomprehensible, I will change that. – Casimir et Hippolyte Feb 08 '18 at 05:49
  • The author of the code either intended it to be a meta or not (possibly they forgot to escape it and make it literal). There is no third option of "neither". The pattern `/.{3\}/` is valid and will match the literal string of any single character plus `{3}`, e.g. `a{3}`. – CJ Dennis Feb 08 '18 at 05:51
  • @CJDennis: a function doesn't care about the author's intent, its job is to return always the same answer whatever the situation, and the goal of `preg_quote` is to neutralize all characters that are likely to be seen as a part of a special sequence. In my examples too, there are no author, no intent, I illustrate only what is the expected behaviour of this function. – Casimir et Hippolyte Feb 08 '18 at 06:26
  • Change `~` to `/`. I somewhat agree with you, but the only use-case is to make bad patterns always fail. `(?:` is atomic. It's bad form to try to construct it out of `(?` + `:`, as is supplying an opening parenthesis and expecting the user to supply the closing one correctly. If your pattern is well-written you don't need to escape any of the extra characters. – CJ Dennis Feb 08 '18 at 07:02
  • @CJDennis; one more time, the goal of a function is not to believe (or to try to understand what want) a coder or a user. – Casimir et Hippolyte Feb 08 '18 at 07:08
  • If the aim is consistency, why isn't `'` escaped? – CJ Dennis Feb 08 '18 at 08:02
  • @CJDennis: The reason is probably historical. Syntaxes that use `'` comes with pcre 7.0 (dec 2006). If you consider that `preg_quote` appears with PHP4 and that the hyphen was only added in PHP5.3, that gives you a good idea of the development rhythm. Same thing for the `&`. – Casimir et Hippolyte Feb 08 '18 at 20:17
0

Well what happens if you're trying to write some code like this:

$lookahead = getUserInput();  // Not escaped
$results = preg_match('/abc(?' . $lookahead . ')/', $subject);

and the user gives the input !def? The answer is you get negative lookahead instead of regular lookahead. If you don't want to allow negative lookaheads, you're going to want to make sure that exclamation mark is escaped.

Max
  • 913
  • 1
  • 7
  • 18
  • You didn't `preg_quote()` the user input, and even if you did, `(!=def)` is not special except for the capturing parentheses. It would match any instance of `abc!=def`, setting capture[0] to `abc!=def` and capture[1] to `!=def`. If you add a `?` to the start of the user input, it will get quoted and only match a literal question mark. – CJ Dennis Feb 08 '18 at 03:00
  • @CJDennis Yes, the point is to demonstrate what happens when you don't `preg_quote()`. This is why `preg_quote()` escapes all valid regex symbols and not just the ones that are valid in an outermost context. – Max Feb 08 '18 at 03:06
  • Please read my question again. I didn't ask what the purpose of `preg_quote()` is, I asked why some of the characters are quoted unnecessarily. E.g. someone wants to quote the `@` sign because they deal with email addresses. Since `@` is never special, it doesn't neee to be quoted, but `.` does. – CJ Dennis Feb 08 '18 at 04:44
  • @CJDennis I used that to demonstrate an example of why one might need to escape these unusual characters. Exclamation mark is used in negative lookahead and lookbehind. Angle brackets are used in named captures. Colon is used in named character classes. And at symbol (@) is NOT escaped. (See for yourself in the PCRE source: https://raw.githubusercontent.com/php/php-src/master/ext/pcre/php_pcre.c ) but it IS escaped in perl (from whence it draws its name) because it is the array sigil for perl. – Max Feb 08 '18 at 07:52