0

I am using UTF-8 regex to get the parts of the Content-Type: header line, since I am in the habit to configure my servers to consistently use UTF-8.

// example type, actually this will be negotiated from request `Accept:` header line.
$content_type = 'TeXt/HtMl';
preg_match('~^([\w-]+\*?)/([\w-]+\*?)$~ui', $content_type, $matches);

I consider to load classes from a filesystem path built based on the subpattern matches.

Is there any thinkable way to inject some '/../' by encoding attacks? How does internal encoding work in general? Do I have to care what charset the request is encoded when processing data in PHP code or does the convertion work automatically and reliably? What else is to keep in mind with encoding security? How can one ensure encoding in deployed code running on unknown systems?

EDIT: As asked in comments, some further code could look like e.g.:

m1 = strtolower($matches[1]);
m2 = strtolower($matches[2]);
include_once "/path/to/project/content_handlers/{$m1}_{$m2}";

Remarks: My question was meant to be more general. Let's think about some scenario: The PHP script is encoded in UTF-8. The server's filesystem is encoded in character set A. The client manipulates the request to be sent in encoding B. Is there a potential risk that the accepted header is written in a way the preg_* functions do not recognize some '/../' (parent directory) but the filesystem? The question is not limited to the particular regex in the example. Could an attacker be able to include arbitrary files present in the filesystem when not taking further precautions?

Remarks 2: In the provided example I cannot rely on http_negotiate_content_type since it is not sure if pecl_http is installed on the target server. There is a scripted polyfill as well. Again: This is not a question for a particular case. I want to learn how to treat (even manipulated) client encodings in general.

Remarks 3: Some similar problem (with SQL encoding attacks) is disussed here: Are PDO prepared statements sufficient to prevent SQL injection? However, my question is about filesystem encoding. Could happen something similar?

Pinke Helga
  • 6,378
  • 2
  • 22
  • 42
  • It all depends on its following lines of code. Please update your post including them. – revo Sep 15 '18 at 03:52
  • As mentioned `$matches[0]` and `$matches[1]` are used to build a filesystem path. The basic question is if one can generally rely on `preg_match('~whatever~u')`. I do not want unnecessary overhead if this is already safe. – Pinke Helga Sep 15 '18 at 04:54
  • What you are going to do with `$matches` is important. You have to show us. In general, no, it's not safe. – revo Sep 15 '18 at 04:56
  • @mickmackusa Yes, * ist the optional wildcard and insensitive match is required since according to the specs all of 'text/html', 'Text/Html' and 'TExt/htML' are equivalent, even if the first one ist recommended and almost every user agent complies with. – Pinke Helga Sep 15 '18 at 05:51
  • Good point - that's correct. – Pinke Helga Sep 15 '18 at 05:52

1 Answers1

1

I'll be bold and say that your code will effectively prevent malicious substrings. If someone is trying to sneak a sequence of characters, they will be smacked down by preg_match(). Your use of anchors and character classes gives no wiggle-room. The pattern is nice and strict.

Just a couple of notes:

  1. \w is already case-insensitive, so the i pattern modifier is not necessary.
  2. Your capture groups are stored in $matches[1] and $matches[2]. The fullstring match is in $matches[0].

Code:

$content_type = 'TeXt/HtMl';
if (!preg_match('~^([\w-]+\*?)/([\w-]+\*?)$~u', $content_type, $matches)) {
    echo "invalid content type";
} else {
    var_export($matches);
}

Output:

array (
  0 => 'TeXt/HtMl',
  1 => 'TeXt',
  2 => 'HtMl',
)
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • Yes, of course in my code I used indexes 1 and 2 of subpatterns. It was a mistake when quickly providing some hand written example pressured by comments. I've just corrected this by editing the question to prevent misleading answers. My question is intended to be more general: When the user agent sends some e.g. chinese encoded request, does PHP internally work on UTF-8 (or other pre-configured encoding) and will be any preg_* function with `u` modifier safe? – Pinke Helga Sep 15 '18 at 13:03
  • With the unicode pattern modifier, you are going to be all good with multibyte characters. Do you mean like this: https://3v4l.org/ml7Za ? – mickmackusa Sep 15 '18 at 13:08
  • Yes. there are some well known sql injection scenarios where sanitizing strings included into a SQL command does not properly work since the database recognizes some quote which the sanitize function did not due to different mb-encodings. – Pinke Helga Sep 15 '18 at 13:16
  • We weren't originally talking about sql at all, right? If you are going to be using user-supplied data in a query, prepared statements are the right way to go. – mickmackusa Sep 15 '18 at 13:24
  • Right, this question is not about SQL, it is about similar potential risks with the filesystem encoding. SQL ecoding attacks are just a well known scenario as comparison. – Pinke Helga Sep 15 '18 at 13:36
  • btw. even when using prepared statement, you might experience some bad surprise. With PDO e.g. you can not completely prevent (or it's very hard) emulation mode depending on the driver installed on the target system. – Pinke Helga Sep 15 '18 at 13:40
  • _you might experience some bad surprise_ <- that's a very vague argument. I'm happy to read, if you want to drop any legitimate reference material. – mickmackusa Sep 15 '18 at 13:42
  • https://stackoverflow.com/questions/134099/are-pdo-prepared-statements-sufficient-to-prevent-sql-injection - in any case +1 for your effort and discovering careless mistakes on question editing. :) – Pinke Helga Sep 15 '18 at 13:44
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/180113/discussion-between-quasimodos-clone-and-mickmackusa). – Pinke Helga Sep 15 '18 at 13:52
  • I suppose if you are extremely concerned with security, then you will need to sacrifice flexibility and write a whitelist of acceptable content types and check if the strtolower() value has an identical match in that set. – mickmackusa Sep 15 '18 at 14:06