0

I'm using preg match with php and discovered some weird behaviour as shown below (behaviour seems to be consistent across any version):

var_dump(preg_match('/[£]/', '«')); // true
var_dump(preg_match('/£/', '«')); // false

var_dump(preg_match('/[»]/', '«')); // true
var_dump(preg_match('/»/', '«')); // false

I would expect all of those expressions to return false however when using square brackets in regex (meaning match any char in this set) the regex returns true. I did check beforehand that multibyte strings were supported and was informed that that was the case however I may be mistaken? I would normally use the mb_ereg alternatives however there is not one for preg_replace_callback which is what I want to use. At the end of the day I just want to know what's going on here , I've found a workaround so that's not much of a problem but this just seems like really weird behaviour!

Henry Howeson
  • 677
  • 8
  • 18
  • 1
    You have to add the unicode flag for tests like these, iirc. As in `'/[£]/u'`. I remember multibyte issues used to be a nightmare in PHP; haven't done that in ages. ^^ – oriberu Mar 21 '20 at 19:02
  • @oriberu Thank you! Problem solved (: I feel a bit stupid now, I think the inconsistent behaviour with the square brackets threw me off a bit! Feel free to make an answer and I'll accept it – Henry Howeson Mar 21 '20 at 19:06
  • 1
    Great. :) You wouldn't believe how much I bungled up when web sites and servers began switching to unicode in the late 90s. In the beginning I just didn't know why the bleep things wouldn't work. So, I'm sympathetic ^^ – oriberu Mar 21 '20 at 19:16

1 Answers1

0

You have to add the UTF-8 flag for tests like these, i.e '/[£]/u'.

From the PHP docs:

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

oriberu
  • 1,186
  • 9
  • 6