I've been looking at the man page for pcre2
, and trying to figure out precisely what situations require which definitions of PCRE2_CODE_UNIT_WIDTH
.
The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit code units, which means that up to three separate libraries may be installed.
Question 1: What exactly is PCRE2's code unit? Does this mean I need to use PCRE2_CODE_UNIT_WIDTH 8
to handle char*
versus PCRE2_CODE_UNIT_WIDTH 32
for wchar *
? what if my platform's wchar
is 16-bit? would that require conditionally using PCRE2_CODE_UNIT_WIDTH 16
? If this is true, it seems like according to How big is wchar_t with GCC? I would need to use the PCRE2_CODE_UNIT_WIDTH = 8 * __SIZEOF_WCHAR_T__
On the topic of Unicode:
In all three cases, strings can be interpreted either as one character per code unit, or as UTF-encoded Unicode, with support for Unicode general category properties. Unicode support is optional at build time (but is the default). However, processing strings as UTF code units must be enabled explicitly at run time.
Question 2: What exactly does PCRE2_CODE_UNIT_WIDTH mean when Unicode is enabled? Does PCRE2_CODE_UNIT_WIDTH 8
take UTF-8, and I need to set PCRE2_CODE_UNIT_WIDTH 16
to handle a UTF-16 string?