PHP preg_ /u utf-8 switch - Not understanding what it does in practice

Question

I am converting a php/mariadb web application from latin1 to utf-8. I have it working but I am not using the /u switch on any of my preg_ statements and it seems to be working fine. I have tried samples of russian, chinese traditional and simple, japanese, arabic, hindu. Part of the application is a wiki which uses preg statements extensively and it works fine also.

So what is the preg /u switch suppose to do? ...since it seems to work fine without it?

I have been looking up information on this for 2 weeks and I can't find anything that explains the /u switch in a way that differentiates its use from 'not' using it.

I have determined that I do have the utf-8 pcre features in the prce that my php is using. I'm using PHP v5.6.20, MariaDB 5.5.32. I've got my web pages, mysql driver and mariadb all using utf-8.

score 0 · Accepted Answer · answered May 06 '16 at 18:47

The u modifier is used by PCRE when deciding how to handle certain matching cases. For example, with the dot metacharacter, multiple bytes are permitted, assuming they form a valid UTF-8 sequence:

preg_match('/^.$/', '老');  // 0
preg_match('/^.$/u', '老'); // 1

Another example, when considering what is covered by a character class:

preg_match('/^[[:print:]]$/', '老'); // 0
preg_match('/^[[:print:]]$/u', '老'); // 1

When including UTF-8 (or indeed a string encoded in any other encoding) directly in the regex, the u modifier effectively makes no difference, as PCRE is ultimately going match byte-by-byte.

Oh! Ok. I understand now. It affects the metacharacter matching! That's the distinction I wasn't catching on to. Makes sense now since the other modifiers (gsi) also affect how the metacharacter matching works. Thanks for the clarification. You made my day! :-) — rick zee, May 06 '16 at 19:04

PHP preg_ /u utf-8 switch - Not understanding what it does in practice

1 Answers1