3

I'm running into an unexpected character replacement problem. The character code is 8217, .

I've tried escaping the character with a slash, but it didn't make a difference.

php > $a = preg_replace('/([.,\'"’:?!])<\/a>/', '</a>\1', 'letter">Evolution’</a> </li>');
php > echo($a);
// => letter">Evolution/a> </li>

// Just to show that it works if the character is different
php > $a = preg_replace('/([.,\'"’:?!])<\/a>/', '</a>\1', 'letter">Evolution"</a> </li>');
php > echo($a);
letter">Evolution</a>" </li>

I would expect it to output

letter">Evolution</a>’ </li>

instead of

letter">Evolution/a> </li>

  • 2
    Looks like an encoding issue. This may help: https://stackoverflow.com/questions/19629893/does-preg-replace-change-my-character-set – Bananaapple Aug 30 '19 at 14:59

2 Answers2

2

By default, pcre (the php regex engine) considers your pattern as a succession of single byte encoded characters. So when you write [’] you obtain a character class with the three bytes on which THE RIGHT SINGLE QUOTATION MARK (U+2019) is encoded, i.e.: \xE2, \x80, \x99.

In other words, writting "/[’]/" in this default mode is like writting "/[\xE2\x80\x99]/" or "/[\x80\xE2\x99]/" or "/[\x99\xE2\x80]/" etc., the regex engine doesn't see a sequence of bytes that represents the character but only three bytes.

This is the reason why you obtain a strange result, because [.,\'"’:?!] will only match the last byte of so \x99.

To solve the problem, you have to force the regex engine to read your pattern as an UTF-8 encoded string. You can do that with one of this ways:

  • preg_replace('~(*UTF)([.,\'"’:?!])</a>~', '</a>\1', 'letter">Evolution’</a> </li>');
  • preg_replace('~([.,\'"’:?!])</a>~u', '</a>\1', 'letter">Evolution’</a> </li>');

This time the three bytes \xE2\x80\x99 are seen as an atomic sequence for the character .

Notice: (*UTF) is only for the reading of the pattern but the u modifier does more things: it extends shorthand character classes (like \s, \w,\d) to unicode characters and checks if the subject string is utf-8 encoded.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • 1
    > n other words, writting "/[’]/" in this default mode is like writting "/[\xE2\x80\x99]/" or "/[\x80\xE2\x99]/" or "/[\x99\xE2\x80]/" etc., the regex engine doesn't see a sequence of bytes that represents the character ’ but only three bytes. > This is the reason why you obtain a strange result, because [.,\'"’:?!] will only match the last byte of ’ so \x99. Ahh, awesome, really appreciate the explanation of why it was failing :) – David da Silva Aug 30 '19 at 15:59
2

Just add unicode flag to the regex:

$a = preg_replace('/([.,\'"’:?!])<\/a>/u', '</a>\1', 'letter">Evolution’</a> </li>');
#                              here ___^
echo($a); 
Toto
  • 89,455
  • 62
  • 89
  • 125