Short answer
In non-UTF mode, UTF-8 library
Assuming your PCRE library bundled with PHP is compiled as 8-bit library (UTF-8), then in non-UTF mode
\C
is equivalent to
[\x00-\xff]
and
(?s:.)
Any of them can be used in a look-behind as replacement for offset
field in preg_match
and preg_match_all
functions.
In non-UTF mode, all of them matches 1 data unit, which is 1 byte in 8-bit (UTF-8) PCRE library, and they match all 256 possible different values.
In UTF-mode, UTF-8 library
UTF mode can be activated by u
flag in the pattern passed to preg_*
function, or by specifying (*UTF)
, (*UTF8)
, (*UTF16)
, (*UTF32)
verbs at the beginning of the pattern.
In UTF mode, character class []
and dot metacharacter .
will match one code point within valid range of Unicode character and is not a surrogate. Since one code point can be encoded into 1 to 4 bytes in UTF-8, and due to the encoding scheme of UTF-8, it is not possible to use character class construct to match a single byte for values in the range 0x80 to 0xFF.
While \C
is specifically designed to match one data unit (which is one byte in UTF-8) regardless of whether UTF mode is on or not, it is not supported in look-behind construct in UTF mode.
UTF-16 and UTF-32 library
I don't know if anyone actually compiles 16-bit or 32-bit PCRE library, includes it in the PHP library and actually makes it work. If anyone knows of such build being widely used in the wild, please ping me. I actually have no clue how the string and the offset from PHP is passed to the C API of PCRE, depending on which the result of preg_*
functions may differ.
More details
At C API level of PCRE library, you can only work with data unit, which is in 8-bit units for 8-bit library, in 16-bit units for 16-bit library and in 32-bit units for 32-bit library.
For 8-bit library (UTF-8), 1 data unit is 8-bit or 1 byte, so there is not much barrier to specifying offset in bytes, whether as a parameter to function, or as a regex construct.
Regex constructs
In non-UTF mode, character class []
, dot .
and \C
matches exactly 1 data unit.
\C
matches 1 data unit, regardless in UTF-mode or non-UTF mode. It can't be used in look-behind in UTF-mode, though.
MATCHING A SINGLE DATA UNIT
Outside a character class, the escape sequence \C
matches any one data
unit, whether or not a UTF mode is set.
.
matches 1 data unit in non-UTF mode.
General comments about UTF modes
[...]
- The dot metacharacter matches one UTF character instead of a single
data unit.
Character class matches 1 data unit in non-UTF mode. The documentation doesn't explicitly state this, but it's implied by the wording.
SQUARE BRACKETS AND CHARACTER CLASSES
[...]
A character class matches a single character in the subject. In a UTF
mode, the character may be more than one data unit long.
The same conclusion can be reached by looking at the upper limit of \x{hh...}
syntax to specify character by hex code in non-UTF mode. Through testing, the last clause about surrogate doesn't seem to apply to non-UTF-mode.
Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
called "surrogate" codepoints), and 0xffef.
Offset
All offset supplied and returned are in number of data units:
The string to be matched by pcre_exec()
The subject string is passed to pcre_exec()
as a pointer in subject
, a
length in length
, and a starting offset in startoffset
. The units for
length
and startoffset
are bytes for the 8-bit library, 16-bit data
items for the 16-bit library, and 32-bit data items for the 32-bit
library.
How pcre_exec()
returns captured substrings
[...]
When a match is successful, information about captured substrings is
returned in pairs of integers, starting at the beginning of ovector,
and continuing up to two-thirds of its length at the most. The first
element of each pair is set to the offset of the first character in a
substring, and the second is set to the offset of the first character
after the end of a substring. These values are always data unit off-
sets, even in UTF mode.