PHP unexpected output when replacing character code 8217 in string

Question

I'm running into an unexpected character replacement problem. The character code is 8217, ’.

I've tried escaping the character with a slash, but it didn't make a difference.

php > $a = preg_replace('/([.,\'"’:?!])<\/a>/', '</a>\1', 'letter">Evolution’</a> </li>');
php > echo($a);
// => letter">Evolution/a> </li>

// Just to show that it works if the character is different
php > $a = preg_replace('/([.,\'"’:?!])<\/a>/', '</a>\1', 'letter">Evolution"</a> </li>');
php > echo($a);
letter">Evolution</a>" </li>

I would expect it to output

letter">Evolution</a>’ </li>

instead of

letter">Evolution/a> </li>

Looks like an encoding issue. This may help: https://stackoverflow.com/questions/19629893/does-preg-replace-change-my-character-set — Bananaapple, Aug 30 '19 at 14:59

Casimir et Hippolyte · Accepted Answer · 2019-08-30T16:02:39.093

By default, pcre (the php regex engine) considers your pattern as a succession of single byte encoded characters. So when you write [’] you obtain a character class with the three bytes on which THE RIGHT SINGLE QUOTATION MARK (U+2019) is encoded, i.e.: \xE2, \x80, \x99.

In other words, writting "/[’]/" in this default mode is like writting "/[\xE2\x80\x99]/" or "/[\x80\xE2\x99]/" or "/[\x99\xE2\x80]/" etc., the regex engine doesn't see a sequence of bytes that represents the character ’ but only three bytes.

This is the reason why you obtain a strange result, because [.,\'"’:?!] will only match the last byte of ’ so \x99.

To solve the problem, you have to force the regex engine to read your pattern as an UTF-8 encoded string. You can do that with one of this ways:

preg_replace('~(*UTF)([.,\'"’:?!])</a>~', '</a>\1', 'letter">Evolution’</a> </li>');
preg_replace('~([.,\'"’:?!])</a>~u', '</a>\1', 'letter">Evolution’</a> </li>');

This time the three bytes \xE2\x80\x99 are seen as an atomic sequence for the character ’.

Notice: (*UTF) is only for the reading of the pattern but the u modifier does more things: it extends shorthand character classes (like \s, \w,\d) to unicode characters and checks if the subject string is utf-8 encoded.

> n other words, writting "/[’]/" in this default mode is like writting "/[\xE2\x80\x99]/" or "/[\x80\xE2\x99]/" or "/[\x99\xE2\x80]/" etc., the regex engine doesn't see a sequence of bytes that represents the character ’ but only three bytes. > This is the reason why you obtain a strange result, because [.,\'"’:?!] will only match the last byte of ’ so \x99. Ahh, awesome, really appreciate the explanation of why it was failing :) — David da Silva, Aug 30 '19 at 15:59

score 2 · Answer 2 · answered Aug 30 '19 at 15:43

2

Just add unicode flag to the regex:

$a = preg_replace('/([.,\'"’:?!])<\/a>/u', '</a>\1', 'letter">Evolution’</a> </li>');
#                              here ___^
echo($a);

answered Aug 30 '19 at 15:43

Toto

89,455
62
89
125

PHP unexpected output when replacing character code 8217 in string

2 Answers2