4

I've been looking at the man page for pcre2, and trying to figure out precisely what situations require which definitions of PCRE2_CODE_UNIT_WIDTH.

The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 32-bit code units, which means that up to three separate libraries may be installed.

Question 1: What exactly is PCRE2's code unit? Does this mean I need to use PCRE2_CODE_UNIT_WIDTH 8 to handle char* versus PCRE2_CODE_UNIT_WIDTH 32 for wchar * ? what if my platform's wchar is 16-bit? would that require conditionally using PCRE2_CODE_UNIT_WIDTH 16? If this is true, it seems like according to How big is wchar_t with GCC? I would need to use the PCRE2_CODE_UNIT_WIDTH = 8 * __SIZEOF_WCHAR_T__

On the topic of Unicode:

In all three cases, strings can be interpreted either as one character per code unit, or as UTF-encoded Unicode, with support for Unicode general category properties. Unicode support is optional at build time (but is the default). However, processing strings as UTF code units must be enabled explicitly at run time.

Question 2: What exactly does PCRE2_CODE_UNIT_WIDTH mean when Unicode is enabled? Does PCRE2_CODE_UNIT_WIDTH 8 take UTF-8, and I need to set PCRE2_CODE_UNIT_WIDTH 16 to handle a UTF-16 string?

Jimmy
  • 89,068
  • 17
  • 119
  • 137

1 Answers1

5

What exactly is PCRE2's code unit?

Here's what PCRE2 uses for its code unit definitions (in pcre2.h):

/* Types for code units in patterns and subject strings. */

typedef uint8_t  PCRE2_UCHAR8;
typedef uint16_t PCRE2_UCHAR16;
typedef uint32_t PCRE2_UCHAR32;

typedef const PCRE2_UCHAR8  *PCRE2_SPTR8;
typedef const PCRE2_UCHAR16 *PCRE2_SPTR16;
typedef const PCRE2_UCHAR32 *PCRE2_SPTR32;

So you can see that PCRE2 uses uintX_t under the hood instead of char/wchar_t.

Note that when you define PCRE2_CODE_UNIT_WIDTH to 8, 16 or 32, PCRE2_UCHAR and PCRE2_SPTR will be #defined to the correct variant.

So yes, PCRE2_CODE_UNIT_WIDTH = 8 * __SIZEOF_WCHAR_T__ seems reasonable at first glance, but wchar_t is not meant to handle Unicode data. Avoid it if you want to write portable code, and just use char/uint8_t for UTF-8, uint16_t for UTF-16 and uint32_t for UTF-32.

Don't confuse code units with code points, as several code units can be required to encode a single code point.

What exactly does PCRE2_CODE_UNIT_WIDTH mean when Unicode is enabled? Does PCRE2_CODE_UNIT_WIDTH 8 take UTF-8, and I need to set PCRE2_CODE_UNIT_WIDTH 16 to handle a UTF-16 string?

Yes. You can also set PCRE2_CODE_UNIT_WIDTH to 0 if you need to handle several encodings in your program. You'll lose the aliases like pcre2_match, and you'll have to call pcre2_match_8 or pcre2_match_16 for instance.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158