There are no multibyte 'preg' functions available in PHP, so does that mean the default preg_functions are all mb safe? Couldn't find any mention in the php documentation.
-
2I'm 90% sure the underlieing C functions are, but that doesn't mean the PHP versions are I suppose... – Matthew Scharley Nov 19 '09 at 21:00
5 Answers
pcre supports utf8 out of the box, see documentation for the 'u' modifier.
Illustration (\xC3\xA4 is the utf8 encoding for the german letter "ä")
echo preg_replace('~\w~', '@', "a\xC3\xA4b");
this echoes "@@¤@" because "\xC3" and "\xA4" were treated as distinct symbols
echo preg_replace('~\w~u', '@', "a\xC3\xA4b");
(note the 'u') prints "@@@" because "\xC3\xA4" were treated as a single letter.

- 170,088
- 45
- 397
- 571

- 53,363
- 19
- 95
- 127
-
Really? Hmm, I'm not overly proficient with regex strings, if you don't mind I might post some of my preg_ code to see what you think? – Spoonface Nov 19 '09 at 22:08
-
-
I was getting error when `json_encode`ing a string after calling `preg_replace`, but failing because `preg_replace` converted some UTF-8 characters to the *replacement character*. The `u` modifier saved my day!!! Thanks a lot for that. – Jay Dadhania Dec 05 '19 at 16:02
PCRE can support UTF-8 and other Unicode encodings, but it has to be specified at compile time. From the man page for PCRE 8.0:
The current implementation of PCRE corresponds approximately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.
PHP currently uses PCRE 7.9; your system might have an older version.
Taking a look at the PCRE lib that comes with PHP 5.2, it appears that it's configured to support Unicode properties and UTF-8. Same for the 5.3 branch.

- 83,810
- 28
- 209
- 234

- 75,655
- 22
- 151
- 221
-
1I'm using PHP 5.3.0 which includes PCRE Version 7.9, I checked the PCRE config.h file which includes the UTF8 definition, so looks like the preg_funcs are safe. Thanks very much for the info! – Spoonface Nov 19 '09 at 21:50
-
Is there a quick way to determine which version of PCRE an existing PHP installation is using? My server for instance is running PHP 5.5, but how can I tell what PCRE library it was compiled with? – thatidiotguy Feb 27 '17 at 18:02
-
1As a note to anyone using PREG_OFFSET_CAPTURE, the offset is in *bytes*, as such you'll want to use substr, not mb_substr and the like. – Consti P Dec 05 '22 at 12:02
No, they are not. See the question preg_match and UTF-8 in PHP for example.
-
To clarify, the `PREG_OFFSET_CAPTURE` produces byte offsets rather than character offsets. It's coherent with string handling in PHP but it can be pretty confusing. – Álvaro González Oct 02 '13 at 16:23
-
If you use [T-Regx tool](https://t-regx.com), you can use `offset()` or `byteOffset()` methods to get offsets in characters or bytes. – Danon Jan 28 '19 at 18:13
No, you need to use the multibyte string functions like mb_ereg

- 68,394
- 30
- 171
- 212
-
3They're the multi-byte version of the POSIX `ereg` functions, though, which aren't exactly the same as the PCRE `preg` functions. – mercator Nov 19 '09 at 21:28
-
Ben S you are my hero :) I just wanted to purify texts and leave äöüß within the text. preg_replace never did this properly, but mb_ereg does! – Nibbels Apr 19 '17 at 16:18
-
3as long as you use the /u modifier, THEY ARE MULTIBYTE SAFE, as long as that multibyte encoding is UTF-8. the /u engine doesn't support any other encodings than UTF-8 – hanshenrik Jul 07 '17 at 14:26
-
1`preg_match` with `/u` modifier works a treat! thank you @hanshenrik – Matt Sephton Nov 07 '21 at 14:04
Some of my more complicated preg functions:
(1a) validate username as alphanumeric + underscore:
preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/',$username)
(1b) possible UTF alternative:
preg_match('/^[A-Za-z][A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/u',$username)
(2a) validate email:
preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ix",$email))
(2b) possible UTF alternative:
preg_match("/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*@([a-z0-9\-]+\.)+[a-z]{2,6}$/ixu",$email))
(3a) normalize newlines:
preg_replace("/(\n){2,}/","\n\n",$str);
(3b) possible UTF alternative:
preg_replace("/(\n){2,}/u","\n\n",$str);
Do thse changes look alright?

- 1,513
- 1
- 20
- 29
-
I believe your email regular expression will allow '..' anywhere in the email address, which is something you need assertions to prevent. – Anthony Rutledge Jun 21 '16 at 15:00