30

Given certain multibyte character sets, am I correct in assuming that the following doesn't do what it was intended to do?

$string = str_replace('"', '\\"', $string);

In particular, if the input was in a character set that might have a valid character like 0xbf5c, so an attacker can inject 0xbf22 to get 0xbf5c22, leaving a valid character followed by an unquoted double quote (").

Is there an easy way to mitigate this problem, or am I misunderstanding the issue in the first place?

(In my case, the string is going into the value attribute of an HTML input tag: echo 'input type="text" value="' . $string . '">';)

EDIT: For that matter, what about a function like preg_quote()? There's no charset argument for it, so it seems totally useless in this scenario. When you DON'T have the option of limiting charset to UTF-8 (yes, that'd be nice), it seems like you are really handicapped. What replace and quoting functions are available in that case?

user456885
  • 443
  • 1
  • 5
  • 8
  • 1
    See [Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?](http://stackoverflow.com/questions/2652193/can-str-replace-be-safely-used-on-a-utf-8-encoded-string-if-its-only-given-valid) to read why you don't need a mb_str_replace. – Lode Aug 09 '11 at 07:30

3 Answers3

33

No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace or mb_split:

$string = mb_ereg_replace('"', '\\"', $string);
$string = implode('\\"', mb_split('"', $string));

Edit    Here’s a mb_replace implementation using the split-join variant:

function mb_replace($search, $replace, $subject, &$count=0) {
    if (!is_array($search) && is_array($replace)) {
        return false;
    }
    if (is_array($subject)) {
        // call mb_replace for each single string in $subject
        foreach ($subject as &$string) {
            $string = &mb_replace($search, $replace, $string, $c);
            $count += $c;
        }
    } elseif (is_array($search)) {
        if (!is_array($replace)) {
            foreach ($search as &$string) {
                $subject = mb_replace($string, $replace, $subject, $c);
                $count += $c;
            }
        } else {
            $n = max(count($search), count($replace));
            while ($n--) {
                $subject = mb_replace(current($search), current($replace), $subject, $c);
                $count += $c;
                next($search);
                next($replace);
            }
        }
    } else {
        $parts = mb_split(preg_quote($search), $subject);
        $count = count($parts)-1;
        $subject = implode($replace, $parts);
    }
    return $subject;
}

As regards the combination of parameters, this function should behave like the singlebyte str_replace.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • 3
    There is no mb_str_replace in PHP – reko_t Sep 24 '10 at 10:07
  • 1
    Your syntax for mb_ereg_replace() is wrong - it requires a regular expression. I was hoping to avoid heavier processing required by this kind of function, but I guess I'm out of luck. I need to use this for EVERYTHING - even something like preg_quote(), too, right? Despite the normal ereg_* functions being deprecated as of PHP5.3, does the same not apply to mb_ereg_* functions? – user456885 Sep 24 '10 at 10:24
  • 1
    @user456885: `"` is a valid regular expression that describes the character `"` (note that this is not PCRE). But why don’t you use the second split-join variant? – Gumbo Sep 24 '10 at 10:29
  • Sorry, my bad on the PCRE. The split/join trick is nifty indeed. I guess I'm just holding out hope that I don't have to write custom replacements for such trivial things as a simple str_replace() or even standard, built-in PHP functions like preg_quote(). Bummer. – user456885 Sep 24 '10 at 10:32
  • Also, I know the ereg_* functions are now deprecated - does that (or *will* that) include the mb_ereg_* functions too? – user456885 Sep 24 '10 at 10:33
  • Nice work. As reko_t said, though, don't you still have to specify the charset with mb_regex_encoding() BEFORE you use either of these suggested solutions? – user456885 Sep 24 '10 at 21:00
  • Also, watch out for that preg_quote() in your code (which is ALSO not multibyte safe). Presumably, $search is not external input, so in most cases it's OK, but this one could come back to bite you in rare situations. – user456885 Sep 24 '10 at 21:07
  • @user456885: The [configured internal character encoding](http://php.net/manual/en/mbstring.configuration.php#ini.mbstring.internal-encoding) is used if not other specified. – Gumbo Sep 24 '10 at 21:08
  • 1
    This isn't necessary for well-formed UTF-8 text. http://www.phpwact.org/php/i18n/utf-8#str_replace – philfreo Jul 10 '11 at 23:14
  • 5
    Also as said in [Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?](http://stackoverflow.com/questions/2652193/can-str-replace-be-safely-used-on-a-utf-8-encoded-string-if-its-only-given-valid), there is no need for a mb_str_replace as long as the input is valid UTF-8. See [phpwact.org](http://www.phpwact.org/php/i18n/charsets#checking_utf-8_for_well_formedness) and a comment by [hfuecks at the php manual](http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php#54805) for a way to check for (in)valid UTF-8. – Lode Aug 09 '11 at 07:27
  • For more information about speed of `str_replace`, `preg_replace`, `mb_ereg_replace`. I tested by replace a string that contain multi languages for 100000 rounds with these 3 functions and here is the results. `str_replace` 0.15748500823975 seconds. `preg_replace` 0.2059760093689 seconds. `mb_ereg_replace` 0.85110902786255 seconds. – vee May 02 '19 at 15:11
  • I suppose this answer should be rewritten or deleted. – Your Common Sense Feb 26 '23 at 08:10
9

The code is perfectly safe with sane multibyte-encodings like UTF-8 and EUC-TW, but dangerous with broken ones like Shift_JIS, GB*, etc. Rather than going through all the headache and overhead to be safe with these legacy encodings, I would recommend just supporting only UTF-8.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 1
    not always possible, especially for the very popular topic of data mining. – Timo Huovinen Feb 10 '12 at 15:06
  • 3
    @TimoHuovinen: For such applications where you have to deal with data in non-UTF-8 encodings, the simplest solution is reencoding during the input phase so that it's in UTF-8 by the time you process it. – R.. GitHub STOP HELPING ICE Jun 17 '14 at 16:21
  • 1
    easier said than done :) for example this way I discovered a list of unsupported yet [browser supported characters](http://stackoverflow.com/questions/3565713/how-can-i-convert-html-character-references-x5e3-to-regular-utf-8/3566055#3566055). The issues with using DOMDocument with UTF-8. There are many issues like this that will make this a living nightmare for the newcomer. [Like this](http://stackoverflow.com/questions/9210473/how-to-convert-text-with-html-entites-and-invalid-characters-to-its-utf-8-equi) – Timo Huovinen Jun 17 '14 at 20:01
3

You could use either mb_ereg_replace by first specifying the charset with mb_regex_encoding(). Alternatively if you use UTF-8, you can use preg_replace with the u modifier.

reko_t
  • 55,302
  • 10
  • 87
  • 77
  • I can't limit to UTF-8 unfortunately, which I think would solve the problem. I guess mb_ereg_replace() is the only solution out there (?) ... but seems inefficient for a simple str_replace(). I'd have to call it as a replacement for preg_quote() too, eh? ... Also, I know the ereg_* functions are now deprecated - does that include the mb_ereg_* functions too? – user456885 Sep 24 '10 at 10:16
  • Code snippets are better than sentences IMPO – Peyman Mohamadpour May 15 '17 at 06:25