By default, pcre (the php regex engine) considers your pattern as a succession of single byte encoded characters. So when you write [’]
you obtain a character class with the three bytes on which THE RIGHT SINGLE QUOTATION MARK (U+2019) is encoded, i.e.: \xE2
, \x80
, \x99
.
In other words, writting "/[’]/"
in this default mode is like writting "/[\xE2\x80\x99]/"
or "/[\x80\xE2\x99]/"
or "/[\x99\xE2\x80]/"
etc., the regex engine doesn't see a sequence of bytes that represents the character ’
but only three bytes.
This is the reason why you obtain a strange result, because [.,\'"’:?!]
will only match the last byte of ’
so \x99
.
To solve the problem, you have to force the regex engine to read your pattern as an UTF-8 encoded string. You can do that with one of this ways:
preg_replace('~(*UTF)([.,\'"’:?!])</a>~', '</a>\1', 'letter">Evolution’</a> </li>');
preg_replace('~([.,\'"’:?!])</a>~u', '</a>\1', 'letter">Evolution’</a> </li>');
This time the three bytes \xE2\x80\x99
are seen as an atomic sequence for the character ’
.
Notice: (*UTF)
is only for the reading of the pattern but the u
modifier does more things: it extends shorthand character classes (like \s
, \w
,\d
) to unicode characters and checks if the subject string is utf-8 encoded.