24

PHP's str_replace() was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?

Edit: I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.

Kevin
  • 53,822
  • 15
  • 101
  • 132
Manos Dilaverakis
  • 5,849
  • 4
  • 31
  • 57

5 Answers5

22

Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.

In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.

This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).

bobince
  • 528,062
  • 107
  • 651
  • 834
4

It's correct because UTF-8 multibyte characters are exclusively non-ASCII (128+ byte value) characters beginning with a byte that defines how many bytes follow, so you can't accidentally end up matching a part of one UTF-8 multibyte character with another.

To visualise (abstractly):

  • a for an ASCII character
  • 2x for a 2-byte character
  • 3xx for a 3-byte character
  • 4xxx for a 4-byte character

If you're matching, say, a2x3xx (a bytes in ASCII range), since a < x, and 2x cannot be a subset of 3xx or 4xxx, et cetera, you can be safe that your UTF-8 will match correctly, given the prerequisite that all strings are definitely valid UTF-8.

Edit: See bobince's answer for a less abstract explanation.

Community
  • 1
  • 1
pinkgothic
  • 6,081
  • 3
  • 47
  • 72
0

Yes, I think this is correct, at least I couldn't find any counter-example.

Kevin
  • 53,822
  • 15
  • 101
  • 132
user187291
  • 53,363
  • 19
  • 95
  • 127
0

Well, I do have a counter example: I have a UTF8 encoded settings ".ini' file specifying appliation settings like email sender name. it says something like:

email_from = Märta

and I read it from there to variable $sender. Now that I replace the message body (UTF8 again)

regards {sender}

$message = str_replace("{sender}",$sender_name,$message);

The email is absolutely correct in every respect but the sender is totally broken. There are other cases (like explode() ) when something goes wrong with a UTF string. It is healthy before the conversion but not after it. Sorry to say there seems to be no way of correcting this behaviour.

Edit: Actually, explode() is involved in parsing the .ini file so the problem may well lie in that very function so the str_replace() may well be innocent.

karvonen
  • 658
  • 9
  • 10
  • And you’re absolutely positive that your e-mail body is properly declared to be UTF-8 encoded? – Gumbo May 05 '10 at 22:51
  • Yes, absolutely positive. I have had weird experience when using the explode() before. A simplified example: function ech1($var){ echo $var; } function ech2($var){$parts=explode("|",$var); echo $parts[1];} echo1($var); // no problems // now get a concatenated result string from db ("Bjorn|Weckström") and use the second function echo2($concatenated); An explode()-use version will break the UTF totally every time. And this is for web as well, not for mail only. EDIT: sorry for loss of formatting – karvonen May 15 '10 at 11:59
  • I can second this - `str_replace` completely murders UTF8 strings, and I am not aware of any way of working around it. – csvan Oct 27 '15 at 15:32
0

No you cannot.
From practice I am telling you if you have some multibyte symbols like ◊ etc, and others are non-multibyte it wont work correctly, because there are symbols that take 2-4 to place them, str_replace takes fixed bytes, and replaces... In result we have something that isn't any symbols trash etc.

Kevin
  • 53,822
  • 15
  • 101
  • 132
local
  • 17
  • 1