9

If I have a UTF-8 string and want to replace line breaks with the HTML <br> , is this safe?

$var = str_replace("\r\n", "<br>", $var);

I know str_replace isn't UTF-8 safe but maybe I can get away with this. I ask because there isn't an mb_strreplace function.

Prof. Falken
  • 24,226
  • 19
  • 100
  • 173
Jonny Barnes
  • 515
  • 1
  • 12
  • 28

3 Answers3

18

UTF-8 is designed so that multi-byte sequences never contain an anything that looks like an ASCII-character. That is, any time you encounter a byte with a value in the range 0-127, you can safely assume it to be an ASCII character.

And that means that as long as you only try to replace ASCII characters with ASCII characters, str_replace should be safe.

jalf
  • 243,077
  • 51
  • 345
  • 550
  • 3
    * Only try to replace ASCII characters with ASCII characters. – Potatoswatter Oct 10 '10 at 23:21
  • +1 for the proper explanation. I think others tried to say the same thing, but this is the only answer that made perfect sense to me. – Sasha Chedygov Oct 11 '10 at 00:28
  • @Potatoswatter: oops yeah, that's a pretty important point. Edited. ;) – jalf Oct 11 '10 at 11:17
  • actually, it's more safe than that even. UTF-8 is designed so that you can do byte-only-aware text replacements, on UTF-8, in a safe and valid way, as long as all the inputs and outputs are valid UTF-8 strings! And you will NEVER get an error. Not only can he use str_replace for ASCII replacements in a UTF-8 string, but also he can use str_replace for UTF-8 replacements in a UTF-8 string. –  Jul 17 '11 at 14:04
  • UTF8 is indeed great!. They even thought about the byte-only-aware! Thanks for the insight. – Omar Al-Ithawi Jan 11 '12 at 11:19
  • Actually the sequence-by-sequence replace is safe as long as the sequences are valid UTF-8 sequences: even if you replace "שלום" by "和平" as _byte sequences_ it is safe by UTF-8 design – Artyom May 02 '12 at 09:04
  • 1
    @Artyom: well, yes, but that's true for every sane encoding. Given three strings S, S0 and S1, where S0 is a substring of S, as long as all three are valid in encoding E, substituting S0 for S1 in S will *also* yield a string that's valid in encoding E. I don't see what's so special about that. The interesting thing about UTF8 is the guarantees it offers for *other* encodings (ASCII, in this case) – jalf May 02 '12 at 09:16
7

str_replace() is safe for any ascii-safe character.

Btw, you could also look at the nl2br()

zerkms
  • 249,484
  • 69
  • 436
  • 539
1

1st: Use the code-sample markup for code in your questions.

2nd: Yes, it is save.

3rd: It may not be what you want to archieve. This could be better:

$var = str_replace(array("\r\n", "\n", "\r"), "<br/>", $var);

Don't forget that different operating systems handle line breaks different. The code above should replace all line breaks, no matter where they come from.

b_i_d
  • 180
  • 5