Is replacing a line break UTF-8 safe?

Question

If I have a UTF-8 string and want to replace line breaks with the HTML <br> , is this safe?

$var = str_replace("\r\n", "<br>", $var);

I know str_replace isn't UTF-8 safe but maybe I can get away with this. I ask because there isn't an mb_strreplace function.

jalf · Accepted Answer · 2010-10-11T11:17:17.193

18

UTF-8 is designed so that multi-byte sequences never contain an anything that looks like an ASCII-character. That is, any time you encounter a byte with a value in the range 0-127, you can safely assume it to be an ASCII character.

And that means that as long as you only try to replace ASCII characters with ASCII characters, str_replace should be safe.

edited Oct 11 '10 at 11:17

answered Oct 10 '10 at 23:17

jalf

243,077
51
345
550

3

* Only try to replace ASCII characters with ASCII characters. – Potatoswatter Oct 10 '10 at 23:21
+1 for the proper explanation. I think others tried to say the same thing, but this is the only answer that made perfect sense to me. – Sasha Chedygov Oct 11 '10 at 00:28
@Potatoswatter: oops yeah, that's a pretty important point. Edited. ;) – jalf Oct 11 '10 at 11:17
actually, it's more safe than that even. UTF-8 is designed so that you can do byte-only-aware text replacements, on UTF-8, in a safe and valid way, as long as all the inputs and outputs are valid UTF-8 strings! And you will NEVER get an error. Not only can he use str_replace for ASCII replacements in a UTF-8 string, but also he can use str_replace for UTF-8 replacements in a UTF-8 string. – Jul 17 '11 at 14:04
UTF8 is indeed great!. They even thought about the byte-only-aware! Thanks for the insight. – Omar Al-Ithawi Jan 11 '12 at 11:19
Actually the sequence-by-sequence replace is safe as long as the sequences are valid UTF-8 sequences: even if you replace "שלום" by "和平" as _byte sequences_ it is safe by UTF-8 design – Artyom May 02 '12 at 09:04
1

@Artyom: well, yes, but that's true for every sane encoding. Given three strings S, S0 and S1, where S0 is a substring of S, as long as all three are valid in encoding E, substituting S0 for S1 in S will *also* yield a string that's valid in encoding E. I don't see what's so special about that. The interesting thing about UTF8 is the guarantees it offers for *other* encodings (ASCII, in this case) – jalf May 02 '12 at 09:16

zerkms · Answer 2 · 2010-10-10T23:10:17.857

7

str_replace() is safe for any ascii-safe character.

Btw, you could also look at the nl2br()

edited Oct 10 '10 at 23:10

answered Oct 10 '10 at 22:26

zerkms

249,484
69
436
539

score 1 · Answer 3 · answered Oct 10 '10 at 23:00

1st: Use the code-sample markup for code in your questions.

2nd: Yes, it is save.

3rd: It may not be what you want to archieve. This could be better:

$var = str_replace(array("\r\n", "\n", "\r"), "<br/>", $var);

Don't forget that different operating systems handle line breaks different. The code above should replace all line breaks, no matter where they come from.

Is replacing a line break UTF-8 safe?

3 Answers3

Linked