Non-ASCII characters in UTF-8 mode regular expression

Question

Question

Despite the PHP manual stating:

"In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes."

Why do Persian digits match \d or [[:digit:]] in "UTF-8 mode"?

Elaboration

In an answerer's remark in a non-related question it is mentioned that in regular expressions, \d does not only match ASCII digits 0 thru 9 but also, for example, Persian digits (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷).

The above mentioned question is tagged java but the behavior can be observed in PHP as well. With this in mind I wrote the following "test":

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/', $string, $capture);

The resulting array $capture contains a match on 5 only.

Using the u modifier to turn on "UTF-8 mode" and running this:

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/u', $string, $capture);

results in $capture containing matches on both ۳ and 5.

Notes

this question refers to PHP 5.6.22 (newest to date)
both tests were executed while explicitly using the C locale.

The first test without `u` flag is non-sense when your string is not in ASCII, since the match will be carried out with byte semantic. If you use `\w` with [SHIFT-JIS](https://en.wikipedia.org/wiki/Shift_JIS#Shift_JIS_byte_map) encoded string, you may match the second byte of some a character. See example section in this answer for explanation about non-UTF mode and consequences: https://stackoverflow.com/questions/20954580/maximum-hex-value-in-regex/30556342#30556342 — nhahtdh, Jun 07 '16 at 08:17

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

Because the documentation is broken. And it's not the only place where it is so, unfortunately.

PHP uses PCRE under the hood to implement its preg_* functions. PCRE's documentation is thus authoritative there. PHP's documentation is based on PCRE's, but it looks like you found yet another mistake.

Here's what you can read in PCRE's docs (emphasis mine):

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:
[:alnum:]  becomes  \p{Xan}
[:alpha:]  becomes  \p{L}
[:blank:]  becomes  \h
[:digit:]  becomes  \p{Nd}
[:lower:]  becomes  \p{Ll}
[:space:]  becomes  \p{Xps}
[:upper:]  becomes  \p{Lu}
[:word:]   becomes  \p{Xwd}

If you dig further in PHP's docs, you'll find the following:

u (PCRE_UTF8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

This is, unfortunately, a lie. The u modifier in PHP means PCRE_UTF8 | PCRE_UCP (UCP stands for Unicode Character Properties). The PCRE_UCP flag is the one that changes the meaning of \d, \w and the like, as you can see from the docs above. Your tests confirm that.

As a side note, don't infer properties of one regex flavor from another. It doesn't always work (heh, even this chart forgot about the PCRE_UCP option).

Thanks for that elaborate answer, Lucas. Using this information, I've filed a [documentation bug report](https://bugs.php.net/bug.php?id=72353). Let's see if it gets squashed or, indeed, corrected. — Linus Kleen, Jun 07 '16 at 08:42

Non-ASCII characters in UTF-8 mode regular expression

Question

Elaboration

Notes

1 Answers1

u (`PCRE_UTF8`)

Non-ASCII characters in UTF-8 mode regular expression

Question

Elaboration

Notes

1 Answers1

u (PCRE_UTF8)

u (`PCRE_UTF8`)